1 | //===----------------------------------------------------------------------===// |
2 | // Clang Static Analyzer |
3 | //===----------------------------------------------------------------------===// |
4 | |
5 | = Library Structure = |
6 | |
7 | The analyzer library has two layers: a (low-level) static analysis |
8 | engine (GRExprEngine.cpp and friends), and some static checkers |
9 | (*Checker.cpp). The latter are built on top of the former via the |
10 | Checker and CheckerVisitor interfaces (Checker.h and |
11 | CheckerVisitor.h). The Checker interface is designed to be minimal |
12 | and simple for checker writers, and attempts to isolate them from much |
13 | of the gore of the internal analysis engine. |
14 | |
15 | = How It Works = |
16 | |
17 | The analyzer is inspired by several foundational research papers ([1], |
18 | [2]). (FIXME: kremenek to add more links) |
19 | |
20 | In a nutshell, the analyzer is basically a source code simulator that |
21 | traces out possible paths of execution. The state of the program |
22 | (values of variables and expressions) is encapsulated by the state |
23 | (ProgramState). A location in the program is called a program point |
24 | (ProgramPoint), and the combination of state and program point is a |
25 | node in an exploded graph (ExplodedGraph). The term "exploded" comes |
26 | from exploding the control-flow edges in the control-flow graph (CFG). |
27 | |
28 | Conceptually the analyzer does a reachability analysis through the |
29 | ExplodedGraph. We start at a root node, which has the entry program |
30 | point and initial state, and then simulate transitions by analyzing |
31 | individual expressions. The analysis of an expression can cause the |
32 | state to change, resulting in a new node in the ExplodedGraph with an |
33 | updated program point and an updated state. A bug is found by hitting |
34 | a node that satisfies some "bug condition" (basically a violation of a |
35 | checking invariant). |
36 | |
37 | The analyzer traces out multiple paths by reasoning about branches and |
38 | then bifurcating the state: on the true branch the conditions of the |
39 | branch are assumed to be true and on the false branch the conditions |
40 | of the branch are assumed to be false. Such "assumptions" create |
41 | constraints on the values of the program, and those constraints are |
42 | recorded in the ProgramState object (and are manipulated by the |
43 | ConstraintManager). If assuming the conditions of a branch would |
44 | cause the constraints to be unsatisfiable, the branch is considered |
45 | infeasible and that path is not taken. This is how we get |
46 | path-sensitivity. We reduce exponential blow-up by caching nodes. If |
47 | a new node with the same state and program point as an existing node |
48 | would get generated, the path "caches out" and we simply reuse the |
49 | existing node. Thus the ExplodedGraph is not a DAG; it can contain |
50 | cycles as paths loop back onto each other and cache out. |
51 | |
52 | ProgramState and ExplodedNodes are basically immutable once created. Once |
53 | one creates a ProgramState, you need to create a new one to get a new |
54 | ProgramState. This immutability is key since the ExplodedGraph represents |
55 | the behavior of the analyzed program from the entry point. To |
56 | represent these efficiently, we use functional data structures (e.g., |
57 | ImmutableMaps) which share data between instances. |
58 | |
59 | Finally, individual Checkers work by also manipulating the analysis |
60 | state. The analyzer engine talks to them via a visitor interface. |
61 | For example, the PreVisitCallExpr() method is called by GRExprEngine |
62 | to tell the Checker that we are about to analyze a CallExpr, and the |
63 | checker is asked to check for any preconditions that might not be |
64 | satisfied. The checker can do nothing, or it can generate a new |
65 | ProgramState and ExplodedNode which contains updated checker state. If it |
66 | finds a bug, it can tell the BugReporter object about the bug, |
67 | providing it an ExplodedNode which is the last node in the path that |
68 | triggered the problem. |
69 | |
70 | = Notes about C++ = |
71 | |
72 | Since now constructors are seen before the variable that is constructed |
73 | in the CFG, we create a temporary object as the destination region that |
74 | is constructed into. See ExprEngine::VisitCXXConstructExpr(). |
75 | |
76 | In ExprEngine::processCallExit(), we always bind the object region to the |
77 | evaluated CXXConstructExpr. Then in VisitDeclStmt(), we compute the |
78 | corresponding lazy compound value if the variable is not a reference, and |
79 | bind the variable region to the lazy compound value. If the variable |
80 | is a reference, just use the object region as the initializer value. |
81 | |
82 | Before entering a C++ method (or ctor/dtor), the 'this' region is bound |
83 | to the object region. In ctors, we synthesize 'this' region with |
84 | CXXRecordDecl*, which means we do not use type qualifiers. In methods, we |
85 | synthesize 'this' region with CXXMethodDecl*, which has getThisType() |
86 | taking type qualifiers into account. It does not matter we use qualified |
87 | 'this' region in one method and unqualified 'this' region in another |
88 | method, because we only need to ensure the 'this' region is consistent |
89 | when we synthesize it and create it directly from CXXThisExpr in a single |
90 | method call. |
91 | |
92 | = Working on the Analyzer = |
93 | |
94 | If you are interested in bringing up support for C++ expressions, the |
95 | best place to look is the visitation logic in GRExprEngine, which |
96 | handles the simulation of individual expressions. There are plenty of |
97 | examples there of how other expressions are handled. |
98 | |
99 | If you are interested in writing checkers, look at the Checker and |
100 | CheckerVisitor interfaces (Checker.h and CheckerVisitor.h). Also look |
101 | at the files named *Checker.cpp for examples on how you can implement |
102 | these interfaces. |
103 | |
104 | = Debugging the Analyzer = |
105 | |
106 | There are some useful command-line options for debugging. For example: |
107 | |
108 | $ clang -cc1 -help | grep analyze |
109 | -analyze-function <value> |
110 | -analyzer-display-progress |
111 | -analyzer-viz-egraph-graphviz |
112 | ... |
113 | |
114 | The first allows you to specify only analyzing a specific function. |
115 | The second prints to the console what function is being analyzed. The |
116 | third generates a graphviz dot file of the ExplodedGraph. This is |
117 | extremely useful when debugging the analyzer and viewing the |
118 | simulation results. |
119 | |
120 | Of course, viewing the CFG (Control-Flow Graph) is also useful: |
121 | |
122 | $ clang -cc1 -help | grep cfg |
123 | -cfg-add-implicit-dtors Add C++ implicit destructors to CFGs for all analyses |
124 | -cfg-add-initializers Add C++ initializers to CFGs for all analyses |
125 | -cfg-dump Display Control-Flow Graphs |
126 | -cfg-view View Control-Flow Graphs using GraphViz |
127 | -unoptimized-cfg Generate unoptimized CFGs for all analyses |
128 | |
129 | -cfg-dump dumps a textual representation of the CFG to the console, |
130 | and -cfg-view creates a GraphViz representation. |
131 | |
132 | = References = |
133 | |
134 | [1] Precise interprocedural dataflow analysis via graph reachability, |
135 | T Reps, S Horwitz, and M Sagiv, POPL '95, |
136 | http://portal.acm.org/citation.cfm?id=199462 |
137 | |
138 | [2] A memory model for static analysis of C programs, Z Xu, T |
139 | Kremenek, and J Zhang, http://lcs.ios.ac.cn/~xzx/memmodel.pdf |
140 | |