1 | ============ |
2 | Region Store |
3 | ============ |
4 | The analyzer "Store" represents the contents of memory regions. It is an opaque |
5 | functional data structure stored in each ``ProgramState``; the only class that |
6 | can modify the store is its associated StoreManager. |
7 | |
8 | Currently (Feb. 2013), the only StoreManager implementation being used is |
9 | ``RegionStoreManager``. This store records bindings to memory regions using a |
10 | "base region + offset" key. (This allows ``*p`` and ``p[0]`` to map to the same |
11 | location, among other benefits.) |
12 | |
13 | Regions are grouped into "clusters", which roughly correspond to "regions with |
14 | the same base region". This allows certain operations to be more efficient, |
15 | such as invalidation. |
16 | |
17 | Regions that do not have a known offset use a special "symbolic" offset. These |
18 | keys store both the original region, and the "concrete offset region" -- the |
19 | last region whose offset is entirely concrete. (For example, in the expression |
20 | ``foo.bar[1][i].baz``, the concrete offset region is the array ``foo.bar[1]``, |
21 | since that has a known offset from the start of the top-level ``foo`` struct.) |
22 | |
23 | |
24 | Binding Invalidation |
25 | -------------------- |
26 | |
27 | Supporting both concrete and symbolic offsets makes things a bit tricky. Here's |
28 | an example: |
29 | |
30 | .. code-block:: cpp |
31 | |
32 | foo[0] = 0; |
33 | foo[1] = 1; |
34 | foo[i] = i; |
35 | |
36 | After the third assignment, nothing can be said about the value of ``foo[0]``, |
37 | because ``foo[i]`` may have overwritten it! Thus, *binding to a region with a |
38 | symbolic offset invalidates the entire concrete offset region.* We know |
39 | ``foo[i]`` is somewhere within ``foo``, so we don't have to invalidate |
40 | anything else, but we do have to be conservative about all other bindings within |
41 | ``foo``. |
42 | |
43 | Continuing the example: |
44 | |
45 | .. code-block:: cpp |
46 | |
47 | foo[i] = i; |
48 | foo[0] = 0; |
49 | |
50 | After this latest assignment, nothing can be said about the value of ``foo[i]``, |
51 | because ``foo[0]`` may have overwritten it! *Binding to a region R with a |
52 | concrete offset invalidates any symbolic offset bindings whose concrete offset |
53 | region is a super-region **or** sub-region of R.* All we know about ``foo[i]`` |
54 | is that it is somewhere within ``foo``, so changing *anything* within ``foo`` |
55 | might change ``foo[i]``, and changing *all* of ``foo`` (or its base region) will |
56 | *definitely* change ``foo[i]``. |
57 | |
58 | This logic could be improved by using the current constraints on ``i``, at the |
59 | cost of speed. The latter case could also be improved by matching region kinds, |
60 | i.e. changing ``foo[0].a`` is unlikely to affect ``foo[i].b``, no matter what |
61 | ``i`` is. |
62 | |
63 | For more detail, read through ``RegionStoreManager::removeSubRegionBindings`` in |
64 | RegionStore.cpp. |
65 | |
66 | |
67 | ObjCIvarRegions |
68 | --------------- |
69 | |
70 | Objective-C instance variables require a bit of special handling. Like struct |
71 | fields, they are not base regions, and when their parent object region is |
72 | invalidated, all the instance variables must be invalidated as well. However, |
73 | they have no concrete compile-time offsets (in the modern, "non-fragile" |
74 | runtime), and so cannot easily be represented as an offset from the start of |
75 | the object in the analyzer. Moreover, this means that invalidating a single |
76 | instance variable should *not* invalidate the rest of the object, since unlike |
77 | struct fields or array elements there is no way to perform pointer arithmetic |
78 | to access another instance variable. |
79 | |
80 | Consequently, although the base region of an ObjCIvarRegion is the entire |
81 | object, RegionStore offsets are computed from the start of the instance |
82 | variable. Thus it is not valid to assume that all bindings with non-symbolic |
83 | offsets start from the base region! |
84 | |
85 | |
86 | Region Invalidation |
87 | ------------------- |
88 | |
89 | Unlike binding invalidation, region invalidation occurs when the entire |
90 | contents of a region may have changed---say, because it has been passed to a |
91 | function the analyzer can model, like memcpy, or because its address has |
92 | escaped, usually as an argument to an opaque function call. In these cases we |
93 | need to throw away not just all bindings within the region itself, but within |
94 | its entire cluster, since neighboring regions may be accessed via pointer |
95 | arithmetic. |
96 | |
97 | Region invalidation typically does even more than this, however. Because it |
98 | usually represents the complete escape of a region from the analyzer's model, |
99 | its *contents* must also be transitively invalidated. (For example, if a region |
100 | ``p`` of type ``int **`` is invalidated, the contents of ``*p`` and ``**p`` may |
101 | have changed as well.) The algorithm that traverses this transitive closure of |
102 | accessible regions is known as ClusterAnalysis, and is also used for finding |
103 | all live bindings in the store (in order to throw away the dead ones). The name |
104 | "ClusterAnalysis" predates the cluster-based organization of bindings, but |
105 | refers to the same concept: during invalidation and liveness analysis, all |
106 | bindings within a cluster must be treated in the same way for a conservative |
107 | model of program behavior. |
108 | |
109 | |
110 | Default Bindings |
111 | ---------------- |
112 | |
113 | Most bindings in RegionStore are simple scalar values -- integers and pointers. |
114 | These are known as "Direct" bindings. However, RegionStore supports a second |
115 | type of binding called a "Default" binding. These are used to provide values to |
116 | all the elements of an aggregate type (struct or array) without having to |
117 | explicitly specify a binding for each individual element. |
118 | |
119 | When there is no Direct binding for a particular region, the store manager |
120 | looks at each super-region in turn to see if there is a Default binding. If so, |
121 | this value is used as the value of the original region. The search ends when |
122 | the base region is reached, at which point the RegionStore will pick an |
123 | appropriate default value for the region (usually a symbolic value, but |
124 | sometimes zero, for static data, or "uninitialized", for stack variables). |
125 | |
126 | .. code-block:: cpp |
127 | |
128 | int manyInts[10]; |
129 | manyInts[1] = 42; // Creates a Direct binding for manyInts[1]. |
130 | print(manyInts[1]); // Retrieves the Direct binding for manyInts[1]; |
131 | print(manyInts[0]); // There is no Direct binding for manyInts[0]. |
132 | // Is there a Default binding for the entire array? |
133 | // There is not, but it is a stack variable, so we use |
134 | // "uninitialized" as the default value (and emit a |
135 | // diagnostic!). |
136 | |
137 | NOTE: The fact that bindings are stored as a base region plus an offset limits |
138 | the Default Binding strategy, because in C aggregates can contain other |
139 | aggregates. In the current implementation of RegionStore, there is no way to |
140 | distinguish a Default binding for an entire aggregate from a Default binding |
141 | for the sub-aggregate at offset 0. |
142 | |
143 | |
144 | Lazy Bindings (LazyCompoundVal) |
145 | ------------------------------- |
146 | |
147 | RegionStore implements an optimization for copying aggregates (structs and |
148 | arrays) called "lazy bindings", implemented using a special SVal called |
149 | LazyCompoundVal. When the store is asked for the "binding" for an entire |
150 | aggregate (i.e. for an lvalue-to-rvalue conversion), it returns a |
151 | LazyCompoundVal instead. When this value is then stored into a variable, it is |
152 | bound as a Default value. This makes copying arrays and structs much cheaper |
153 | than if they had required memberwise access. |
154 | |
155 | Under the hood, a LazyCompoundVal is implemented as a uniqued pair of (region, |
156 | store), representing "the value of the region during this 'snapshot' of the |
157 | store". This has important implications for any sort of liveness or |
158 | reachability analysis, which must take the bindings in the old store into |
159 | account. |
160 | |
161 | Retrieving a value from a lazy binding happens in the same way as any other |
162 | Default binding: since there is no direct binding, the store manager falls back |
163 | to super-regions to look for an appropriate default binding. LazyCompoundVal |
164 | differs from a normal default binding, however, in that it contains several |
165 | different values, instead of one value that will appear several times. Because |
166 | of this, the store manager has to reconstruct the subregion chain on top of the |
167 | LazyCompoundVal region, and look up *that* region in the previous store. |
168 | |
169 | Here's a concrete example: |
170 | |
171 | .. code-block:: cpp |
172 | |
173 | CGPoint p; |
174 | p.x = 42; // A Direct binding is made to the FieldRegion 'p.x'. |
175 | CGPoint p2 = p; // A LazyCompoundVal is created for 'p', along with a |
176 | // snapshot of the current store state. This value is then |
177 | // used as a Default binding for the VarRegion 'p2'. |
178 | return p2.x; // The binding for FieldRegion 'p2.x' is requested. |
179 | // There is no Direct binding, so we look for a Default |
180 | // binding to 'p2' and find the LCV. |
181 | // Because it's a LCV, we look at our requested region |
182 | // and see that it's the '.x' field. We ask for the value |
183 | // of 'p.x' within the snapshot, and get back 42. |
184 | |