1 | DataFlowSanitizer Design Document |
2 | ================================= |
3 | |
4 | This document sets out the design for DataFlowSanitizer, a general |
5 | dynamic data flow analysis. Unlike other Sanitizer tools, this tool is |
6 | not designed to detect a specific class of bugs on its own. Instead, |
7 | it provides a generic dynamic data flow analysis framework to be used |
8 | by clients to help detect application-specific issues within their |
9 | own code. |
10 | |
11 | DataFlowSanitizer is a program instrumentation which can associate |
12 | a number of taint labels with any data stored in any memory region |
13 | accessible by the program. The analysis is dynamic, which means that |
14 | it operates on a running program, and tracks how the labels propagate |
15 | through that program. The tool shall support a large (>100) number |
16 | of labels, such that programs which operate on large numbers of data |
17 | items may be analysed with each data item being tracked separately. |
18 | |
19 | Use Cases |
20 | --------- |
21 | |
22 | This instrumentation can be used as a tool to help monitor how data |
23 | flows from a program's inputs (sources) to its outputs (sinks). |
24 | This has applications from a privacy/security perspective in that |
25 | one can audit how a sensitive data item is used within a program and |
26 | ensure it isn't exiting the program anywhere it shouldn't be. |
27 | |
28 | Interface |
29 | --------- |
30 | |
31 | A number of functions are provided which will create taint labels, |
32 | attach labels to memory regions and extract the set of labels |
33 | associated with a specific memory region. These functions are declared |
34 | in the header file ``sanitizer/dfsan_interface.h``. |
35 | |
36 | .. code-block:: c |
37 | |
38 | /// Creates and returns a base label with the given description and user data. |
39 | dfsan_label dfsan_create_label(const char *desc, void *userdata); |
40 | |
41 | /// Sets the label for each address in [addr,addr+size) to \c label. |
42 | void dfsan_set_label(dfsan_label label, void *addr, size_t size); |
43 | |
44 | /// Sets the label for each address in [addr,addr+size) to the union of the |
45 | /// current label for that address and \c label. |
46 | void dfsan_add_label(dfsan_label label, void *addr, size_t size); |
47 | |
48 | /// Retrieves the label associated with the given data. |
49 | /// |
50 | /// The type of 'data' is arbitrary. The function accepts a value of any type, |
51 | /// which can be truncated or extended (implicitly or explicitly) as necessary. |
52 | /// The truncation/extension operations will preserve the label of the original |
53 | /// value. |
54 | dfsan_label dfsan_get_label(long data); |
55 | |
56 | /// Retrieves a pointer to the dfsan_label_info struct for the given label. |
57 | const struct dfsan_label_info *dfsan_get_label_info(dfsan_label label); |
58 | |
59 | /// Returns whether the given label label contains the label elem. |
60 | int dfsan_has_label(dfsan_label label, dfsan_label elem); |
61 | |
62 | /// If the given label label contains a label with the description desc, returns |
63 | /// that label, else returns 0. |
64 | dfsan_label dfsan_has_label_with_desc(dfsan_label label, const char *desc); |
65 | |
66 | Taint label representation |
67 | -------------------------- |
68 | |
69 | As stated above, the tool must track a large number of taint |
70 | labels. This poses an implementation challenge, as most multiple-label |
71 | tainting systems assign one label per bit to shadow storage, and |
72 | union taint labels using a bitwise or operation. This will not scale |
73 | to clients which use hundreds or thousands of taint labels, as the |
74 | label union operation becomes O(n) in the number of supported labels, |
75 | and data associated with it will quickly dominate the live variable |
76 | set, causing register spills and hampering performance. |
77 | |
78 | Instead, a low overhead approach is proposed which is best-case O(log\ |
79 | :sub:`2` n) during execution. The underlying assumption is that |
80 | the required space of label unions is sparse, which is a reasonable |
81 | assumption to make given that we are optimizing for the case where |
82 | applications mostly copy data from one place to another, without often |
83 | invoking the need for an actual union operation. The representation |
84 | of a taint label is a 16-bit integer, and new labels are allocated |
85 | sequentially from a pool. The label identifier 0 is special, and means |
86 | that the data item is unlabelled. |
87 | |
88 | When a label union operation is requested at a join point (any |
89 | arithmetic or logical operation with two or more operands, such as |
90 | addition), the code checks whether a union is required, whether the |
91 | same union has been requested before, and whether one union label |
92 | subsumes the other. If so, it returns the previously allocated union |
93 | label. If not, it allocates a new union label from the same pool used |
94 | for new labels. |
95 | |
96 | Specifically, the instrumentation pass will insert code like this |
97 | to decide the union label ``lu`` for a pair of labels ``l1`` |
98 | and ``l2``: |
99 | |
100 | .. code-block:: c |
101 | |
102 | if (l1 == l2) |
103 | lu = l1; |
104 | else |
105 | lu = __dfsan_union(l1, l2); |
106 | |
107 | The equality comparison is outlined, to provide an early exit in |
108 | the common cases where the program is processing unlabelled data, or |
109 | where the two data items have the same label. ``__dfsan_union`` is |
110 | a runtime library function which performs all other union computation. |
111 | |
112 | Further optimizations are possible, for example if ``l1`` is known |
113 | at compile time to be zero (e.g. it is derived from a constant), |
114 | ``l2`` can be used for ``lu``, and vice versa. |
115 | |
116 | Memory layout and label management |
117 | ---------------------------------- |
118 | |
119 | The following is the current memory layout for Linux/x86\_64: |
120 | |
121 | +---------------+---------------+--------------------+ |
122 | | Start | End | Use | |
123 | +===============+===============+====================+ |
124 | | 0x700000008000|0x800000000000 | application memory | |
125 | +---------------+---------------+--------------------+ |
126 | | 0x200200000000|0x700000008000 | unused | |
127 | +---------------+---------------+--------------------+ |
128 | | 0x200000000000|0x200200000000 | union table | |
129 | +---------------+---------------+--------------------+ |
130 | | 0x000000010000|0x200000000000 | shadow memory | |
131 | +---------------+---------------+--------------------+ |
132 | | 0x000000000000|0x000000010000 | reserved by kernel | |
133 | +---------------+---------------+--------------------+ |
134 | |
135 | Each byte of application memory corresponds to two bytes of shadow |
136 | memory, which are used to store its taint label. As for LLVM SSA |
137 | registers, we have not found it necessary to associate a label with |
138 | each byte or bit of data, as some other tools do. Instead, labels are |
139 | associated directly with registers. Loads will result in a union of |
140 | all shadow labels corresponding to bytes loaded (which most of the |
141 | time will be short circuited by the initial comparison) and stores will |
142 | result in a copy of the label to the shadow of all bytes stored to. |
143 | |
144 | Propagating labels through arguments |
145 | ------------------------------------ |
146 | |
147 | In order to propagate labels through function arguments and return values, |
148 | DataFlowSanitizer changes the ABI of each function in the translation unit. |
149 | There are currently two supported ABIs: |
150 | |
151 | * Args -- Argument and return value labels are passed through additional |
152 | arguments and by modifying the return type. |
153 | |
154 | * TLS -- Argument and return value labels are passed through TLS variables |
155 | ``__dfsan_arg_tls`` and ``__dfsan_retval_tls``. |
156 | |
157 | The main advantage of the TLS ABI is that it is more tolerant of ABI mismatches |
158 | (TLS storage is not shared with any other form of storage, whereas extra |
159 | arguments may be stored in registers which under the native ABI are not used |
160 | for parameter passing and thus could contain arbitrary values). On the other |
161 | hand the args ABI is more efficient and allows ABI mismatches to be more easily |
162 | identified by checking for nonzero labels in nominally unlabelled programs. |
163 | |
164 | Implementing the ABI list |
165 | ------------------------- |
166 | |
167 | The `ABI list <DataFlowSanitizer.html#abi-list>`_ provides a list of functions |
168 | which conform to the native ABI, each of which is callable from an instrumented |
169 | program. This is implemented by replacing each reference to a native ABI |
170 | function with a reference to a function which uses the instrumented ABI. |
171 | Such functions are automatically-generated wrappers for the native functions. |
172 | For example, given the ABI list example provided in the user manual, the |
173 | following wrappers will be generated under the args ABI: |
174 | |
175 | .. code-block:: llvm |
176 | |
177 | define linkonce_odr { i8*, i16 } @"dfsw$malloc"(i64 %0, i16 %1) { |
178 | entry: |
179 | %2 = call i8* @malloc(i64 %0) |
180 | %3 = insertvalue { i8*, i16 } undef, i8* %2, 0 |
181 | %4 = insertvalue { i8*, i16 } %3, i16 0, 1 |
182 | ret { i8*, i16 } %4 |
183 | } |
184 | |
185 | define linkonce_odr { i32, i16 } @"dfsw$tolower"(i32 %0, i16 %1) { |
186 | entry: |
187 | %2 = call i32 @tolower(i32 %0) |
188 | %3 = insertvalue { i32, i16 } undef, i32 %2, 0 |
189 | %4 = insertvalue { i32, i16 } %3, i16 %1, 1 |
190 | ret { i32, i16 } %4 |
191 | } |
192 | |
193 | define linkonce_odr { i8*, i16 } @"dfsw$memcpy"(i8* %0, i8* %1, i64 %2, i16 %3, i16 %4, i16 %5) { |
194 | entry: |
195 | %labelreturn = alloca i16 |
196 | %6 = call i8* @__dfsw_memcpy(i8* %0, i8* %1, i64 %2, i16 %3, i16 %4, i16 %5, i16* %labelreturn) |
197 | %7 = load i16* %labelreturn |
198 | %8 = insertvalue { i8*, i16 } undef, i8* %6, 0 |
199 | %9 = insertvalue { i8*, i16 } %8, i16 %7, 1 |
200 | ret { i8*, i16 } %9 |
201 | } |
202 | |
203 | As an optimization, direct calls to native ABI functions will call the |
204 | native ABI function directly and the pass will compute the appropriate label |
205 | internally. This has the advantage of reducing the number of union operations |
206 | required when the return value label is known to be zero (i.e. ``discard`` |
207 | functions, or ``functional`` functions with known unlabelled arguments). |
208 | |
209 | Checking ABI Consistency |
210 | ------------------------ |
211 | |
212 | DFSan changes the ABI of each function in the module. This makes it possible |
213 | for a function with the native ABI to be called with the instrumented ABI, |
214 | or vice versa, thus possibly invoking undefined behavior. A simple way |
215 | of statically detecting instances of this problem is to prepend the prefix |
216 | "dfs$" to the name of each instrumented-ABI function. |
217 | |
218 | This will not catch every such problem; in particular function pointers passed |
219 | across the instrumented-native barrier cannot be used on the other side. |
220 | These problems could potentially be caught dynamically. |
221 | |