1 | =============================================================== |
2 | Tutorial for building tools using LibTooling and LibASTMatchers |
3 | =============================================================== |
4 | |
5 | This document is intended to show how to build a useful source-to-source |
6 | translation tool based on Clang's `LibTooling <LibTooling.html>`_. It is |
7 | explicitly aimed at people who are new to Clang, so all you should need |
8 | is a working knowledge of C++ and the command line. |
9 | |
10 | In order to work on the compiler, you need some basic knowledge of the |
11 | abstract syntax tree (AST). To this end, the reader is incouraged to |
12 | skim the :doc:`Introduction to the Clang |
13 | AST <IntroductionToTheClangAST>` |
14 | |
15 | Step 0: Obtaining Clang |
16 | ======================= |
17 | |
18 | As Clang is part of the LLVM project, you'll need to download LLVM's |
19 | source code first. Both Clang and LLVM are in the same git repository, |
20 | under different directories. For further information, see the `getting |
21 | started guide <https://llvm.org/docs/GettingStarted.html>`_. |
22 | |
23 | .. code-block:: console |
24 | |
25 | cd ~/clang-llvm |
26 | git clone https://github.com/llvm/llvm-project.git |
27 | |
28 | Next you need to obtain the CMake build system and Ninja build tool. |
29 | |
30 | .. code-block:: console |
31 | |
32 | cd ~/clang-llvm |
33 | git clone https://github.com/martine/ninja.git |
34 | cd ninja |
35 | git checkout release |
36 | ./bootstrap.py |
37 | sudo cp ninja /usr/bin/ |
38 | |
39 | cd ~/clang-llvm |
40 | git clone git://cmake.org/stage/cmake.git |
41 | cd cmake |
42 | git checkout next |
43 | ./bootstrap |
44 | make |
45 | sudo make install |
46 | |
47 | Okay. Now we'll build Clang! |
48 | |
49 | .. code-block:: console |
50 | |
51 | cd ~/clang-llvm |
52 | mkdir build && cd build |
53 | cmake -G Ninja ../llvm -DLLVM_ENABLE_PROJECTS=clang -DLLVM_BUILD_TESTS=ON # Enable tests; default is off. |
54 | ninja |
55 | ninja check # Test LLVM only. |
56 | ninja clang-test # Test Clang only. |
57 | ninja install |
58 | |
59 | And we're live. |
60 | |
61 | All of the tests should pass. |
62 | |
63 | Finally, we want to set Clang as its own compiler. |
64 | |
65 | .. code-block:: console |
66 | |
67 | cd ~/clang-llvm/build |
68 | ccmake ../llvm |
69 | |
70 | The second command will bring up a GUI for configuring Clang. You need |
71 | to set the entry for ``CMAKE_CXX_COMPILER``. Press ``'t'`` to turn on |
72 | advanced mode. Scroll down to ``CMAKE_CXX_COMPILER``, and set it to |
73 | ``/usr/bin/clang++``, or wherever you installed it. Press ``'c'`` to |
74 | configure, then ``'g'`` to generate CMake's files. |
75 | |
76 | Finally, run ninja one last time, and you're done. |
77 | |
78 | Step 1: Create a ClangTool |
79 | ========================== |
80 | |
81 | Now that we have enough background knowledge, it's time to create the |
82 | simplest productive ClangTool in existence: a syntax checker. While this |
83 | already exists as ``clang-check``, it's important to understand what's |
84 | going on. |
85 | |
86 | First, we'll need to create a new directory for our tool and tell CMake |
87 | that it exists. As this is not going to be a core clang tool, it will |
88 | live in the ``tools/extra`` repository. |
89 | |
90 | .. code-block:: console |
91 | |
92 | cd ~/clang-llvm/llvm/tools/clang |
93 | mkdir tools/extra/loop-convert |
94 | echo 'add_subdirectory(loop-convert)' >> tools/extra/CMakeLists.txt |
95 | vim tools/extra/loop-convert/CMakeLists.txt |
96 | |
97 | CMakeLists.txt should have the following contents: |
98 | |
99 | :: |
100 | |
101 | set(LLVM_LINK_COMPONENTS support) |
102 | |
103 | add_clang_executable(loop-convert |
104 | LoopConvert.cpp |
105 | ) |
106 | target_link_libraries(loop-convert |
107 | PRIVATE |
108 | clangTooling |
109 | clangBasic |
110 | clangASTMatchers |
111 | ) |
112 | |
113 | With that done, Ninja will be able to compile our tool. Let's give it |
114 | something to compile! Put the following into |
115 | ``tools/extra/loop-convert/LoopConvert.cpp``. A detailed explanation of |
116 | why the different parts are needed can be found in the `LibTooling |
117 | documentation <LibTooling.html>`_. |
118 | |
119 | .. code-block:: c++ |
120 | |
121 | // Declares clang::SyntaxOnlyAction. |
122 | #include "clang/Frontend/FrontendActions.h" |
123 | #include "clang/Tooling/CommonOptionsParser.h" |
124 | #include "clang/Tooling/Tooling.h" |
125 | // Declares llvm::cl::extrahelp. |
126 | #include "llvm/Support/CommandLine.h" |
127 | |
128 | using namespace clang::tooling; |
129 | using namespace llvm; |
130 | |
131 | // Apply a custom category to all command-line options so that they are the |
132 | // only ones displayed. |
133 | static llvm::cl::OptionCategory MyToolCategory("my-tool options"); |
134 | |
135 | // CommonOptionsParser declares HelpMessage with a description of the common |
136 | // command-line options related to the compilation database and input files. |
137 | // It's nice to have this help message in all tools. |
138 | static cl::extrahelp CommonHelp(CommonOptionsParser::HelpMessage); |
139 | |
140 | // A help message for this specific tool can be added afterwards. |
141 | static cl::extrahelp MoreHelp("\nMore help text...\n"); |
142 | |
143 | int main(int argc, const char **argv) { |
144 | CommonOptionsParser OptionsParser(argc, argv, MyToolCategory); |
145 | ClangTool Tool(OptionsParser.getCompilations(), |
146 | OptionsParser.getSourcePathList()); |
147 | return Tool.run(newFrontendActionFactory<clang::SyntaxOnlyAction>().get()); |
148 | } |
149 | |
150 | And that's it! You can compile our new tool by running ninja from the |
151 | ``build`` directory. |
152 | |
153 | .. code-block:: console |
154 | |
155 | cd ~/clang-llvm/build |
156 | ninja |
157 | |
158 | You should now be able to run the syntax checker, which is located in |
159 | ``~/clang-llvm/build/bin``, on any source file. Try it! |
160 | |
161 | .. code-block:: console |
162 | |
163 | echo "int main() { return 0; }" > test.cpp |
164 | bin/loop-convert test.cpp -- |
165 | |
166 | Note the two dashes after we specify the source file. The additional |
167 | options for the compiler are passed after the dashes rather than loading |
168 | them from a compilation database - there just aren't any options needed |
169 | right now. |
170 | |
171 | Intermezzo: Learn AST matcher basics |
172 | ==================================== |
173 | |
174 | Clang recently introduced the :doc:`ASTMatcher |
175 | library <LibASTMatchers>` to provide a simple, powerful, and |
176 | concise way to describe specific patterns in the AST. Implemented as a |
177 | DSL powered by macros and templates (see |
178 | `ASTMatchers.h <../doxygen/ASTMatchers_8h_source.html>`_ if you're |
179 | curious), matchers offer the feel of algebraic data types common to |
180 | functional programming languages. |
181 | |
182 | For example, suppose you wanted to examine only binary operators. There |
183 | is a matcher to do exactly that, conveniently named ``binaryOperator``. |
184 | I'll give you one guess what this matcher does: |
185 | |
186 | .. code-block:: c++ |
187 | |
188 | binaryOperator(hasOperatorName("+"), hasLHS(integerLiteral(equals(0)))) |
189 | |
190 | Shockingly, it will match against addition expressions whose left hand |
191 | side is exactly the literal 0. It will not match against other forms of |
192 | 0, such as ``'\0'`` or ``NULL``, but it will match against macros that |
193 | expand to 0. The matcher will also not match against calls to the |
194 | overloaded operator ``'+'``, as there is a separate ``operatorCallExpr`` |
195 | matcher to handle overloaded operators. |
196 | |
197 | There are AST matchers to match all the different nodes of the AST, |
198 | narrowing matchers to only match AST nodes fulfilling specific criteria, |
199 | and traversal matchers to get from one kind of AST node to another. For |
200 | a complete list of AST matchers, take a look at the `AST Matcher |
201 | References <LibASTMatchersReference.html>`_ |
202 | |
203 | All matcher that are nouns describe entities in the AST and can be |
204 | bound, so that they can be referred to whenever a match is found. To do |
205 | so, simply call the method ``bind`` on these matchers, e.g.: |
206 | |
207 | .. code-block:: c++ |
208 | |
209 | variable(hasType(isInteger())).bind("intvar") |
210 | |
211 | Step 2: Using AST matchers |
212 | ========================== |
213 | |
214 | Okay, on to using matchers for real. Let's start by defining a matcher |
215 | which will capture all ``for`` statements that define a new variable |
216 | initialized to zero. Let's start with matching all ``for`` loops: |
217 | |
218 | .. code-block:: c++ |
219 | |
220 | forStmt() |
221 | |
222 | Next, we want to specify that a single variable is declared in the first |
223 | portion of the loop, so we can extend the matcher to |
224 | |
225 | .. code-block:: c++ |
226 | |
227 | forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl())))) |
228 | |
229 | Finally, we can add the condition that the variable is initialized to |
230 | zero. |
231 | |
232 | .. code-block:: c++ |
233 | |
234 | forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( |
235 | hasInitializer(integerLiteral(equals(0)))))))) |
236 | |
237 | It is fairly easy to read and understand the matcher definition ("match |
238 | loops whose init portion declares a single variable which is initialized |
239 | to the integer literal 0"), but deciding that every piece is necessary |
240 | is more difficult. Note that this matcher will not match loops whose |
241 | variables are initialized to ``'\0'``, ``0.0``, ``NULL``, or any form of |
242 | zero besides the integer 0. |
243 | |
244 | The last step is giving the matcher a name and binding the ``ForStmt`` |
245 | as we will want to do something with it: |
246 | |
247 | .. code-block:: c++ |
248 | |
249 | StatementMatcher LoopMatcher = |
250 | forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( |
251 | hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop"); |
252 | |
253 | Once you have defined your matchers, you will need to add a little more |
254 | scaffolding in order to run them. Matchers are paired with a |
255 | ``MatchCallback`` and registered with a ``MatchFinder`` object, then run |
256 | from a ``ClangTool``. More code! |
257 | |
258 | Add the following to ``LoopConvert.cpp``: |
259 | |
260 | .. code-block:: c++ |
261 | |
262 | #include "clang/ASTMatchers/ASTMatchers.h" |
263 | #include "clang/ASTMatchers/ASTMatchFinder.h" |
264 | |
265 | using namespace clang; |
266 | using namespace clang::ast_matchers; |
267 | |
268 | StatementMatcher LoopMatcher = |
269 | forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( |
270 | hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop"); |
271 | |
272 | class LoopPrinter : public MatchFinder::MatchCallback { |
273 | public : |
274 | virtual void run(const MatchFinder::MatchResult &Result) { |
275 | if (const ForStmt *FS = Result.Nodes.getNodeAs<clang::ForStmt>("forLoop")) |
276 | FS->dump(); |
277 | } |
278 | }; |
279 | |
280 | And change ``main()`` to: |
281 | |
282 | .. code-block:: c++ |
283 | |
284 | int main(int argc, const char **argv) { |
285 | CommonOptionsParser OptionsParser(argc, argv, MyToolCategory); |
286 | ClangTool Tool(OptionsParser.getCompilations(), |
287 | OptionsParser.getSourcePathList()); |
288 | |
289 | LoopPrinter Printer; |
290 | MatchFinder Finder; |
291 | Finder.addMatcher(LoopMatcher, &Printer); |
292 | |
293 | return Tool.run(newFrontendActionFactory(&Finder).get()); |
294 | } |
295 | |
296 | Now, you should be able to recompile and run the code to discover for |
297 | loops. Create a new file with a few examples, and test out our new |
298 | handiwork: |
299 | |
300 | .. code-block:: console |
301 | |
302 | cd ~/clang-llvm/llvm/llvm_build/ |
303 | ninja loop-convert |
304 | vim ~/test-files/simple-loops.cc |
305 | bin/loop-convert ~/test-files/simple-loops.cc |
306 | |
307 | Step 3.5: More Complicated Matchers |
308 | =================================== |
309 | |
310 | Our simple matcher is capable of discovering for loops, but we would |
311 | still need to filter out many more ourselves. We can do a good portion |
312 | of the remaining work with some cleverly chosen matchers, but first we |
313 | need to decide exactly which properties we want to allow. |
314 | |
315 | How can we characterize for loops over arrays which would be eligible |
316 | for translation to range-based syntax? Range based loops over arrays of |
317 | size ``N`` that: |
318 | |
319 | - start at index ``0`` |
320 | - iterate consecutively |
321 | - end at index ``N-1`` |
322 | |
323 | We already check for (1), so all we need to add is a check to the loop's |
324 | condition to ensure that the loop's index variable is compared against |
325 | ``N`` and another check to ensure that the increment step just |
326 | increments this same variable. The matcher for (2) is straightforward: |
327 | require a pre- or post-increment of the same variable declared in the |
328 | init portion. |
329 | |
330 | Unfortunately, such a matcher is impossible to write. Matchers contain |
331 | no logic for comparing two arbitrary AST nodes and determining whether |
332 | or not they are equal, so the best we can do is matching more than we |
333 | would like to allow, and punting extra comparisons to the callback. |
334 | |
335 | In any case, we can start building this sub-matcher. We can require that |
336 | the increment step be a unary increment like this: |
337 | |
338 | .. code-block:: c++ |
339 | |
340 | hasIncrement(unaryOperator(hasOperatorName("++"))) |
341 | |
342 | Specifying what is incremented introduces another quirk of Clang's AST: |
343 | Usages of variables are represented as ``DeclRefExpr``'s ("declaration |
344 | reference expressions") because they are expressions which refer to |
345 | variable declarations. To find a ``unaryOperator`` that refers to a |
346 | specific declaration, we can simply add a second condition to it: |
347 | |
348 | .. code-block:: c++ |
349 | |
350 | hasIncrement(unaryOperator( |
351 | hasOperatorName("++"), |
352 | hasUnaryOperand(declRefExpr()))) |
353 | |
354 | Furthermore, we can restrict our matcher to only match if the |
355 | incremented variable is an integer: |
356 | |
357 | .. code-block:: c++ |
358 | |
359 | hasIncrement(unaryOperator( |
360 | hasOperatorName("++"), |
361 | hasUnaryOperand(declRefExpr(to(varDecl(hasType(isInteger()))))))) |
362 | |
363 | And the last step will be to attach an identifier to this variable, so |
364 | that we can retrieve it in the callback: |
365 | |
366 | .. code-block:: c++ |
367 | |
368 | hasIncrement(unaryOperator( |
369 | hasOperatorName("++"), |
370 | hasUnaryOperand(declRefExpr(to( |
371 | varDecl(hasType(isInteger())).bind("incrementVariable")))))) |
372 | |
373 | We can add this code to the definition of ``LoopMatcher`` and make sure |
374 | that our program, outfitted with the new matcher, only prints out loops |
375 | that declare a single variable initialized to zero and have an increment |
376 | step consisting of a unary increment of some variable. |
377 | |
378 | Now, we just need to add a matcher to check if the condition part of the |
379 | ``for`` loop compares a variable against the size of the array. There is |
380 | only one problem - we don't know which array we're iterating over |
381 | without looking at the body of the loop! We are again restricted to |
382 | approximating the result we want with matchers, filling in the details |
383 | in the callback. So we start with: |
384 | |
385 | .. code-block:: c++ |
386 | |
387 | hasCondition(binaryOperator(hasOperatorName("<")) |
388 | |
389 | It makes sense to ensure that the left-hand side is a reference to a |
390 | variable, and that the right-hand side has integer type. |
391 | |
392 | .. code-block:: c++ |
393 | |
394 | hasCondition(binaryOperator( |
395 | hasOperatorName("<"), |
396 | hasLHS(declRefExpr(to(varDecl(hasType(isInteger()))))), |
397 | hasRHS(expr(hasType(isInteger()))))) |
398 | |
399 | Why? Because it doesn't work. Of the three loops provided in |
400 | ``test-files/simple.cpp``, zero of them have a matching condition. A |
401 | quick look at the AST dump of the first for loop, produced by the |
402 | previous iteration of loop-convert, shows us the answer: |
403 | |
404 | :: |
405 | |
406 | (ForStmt 0x173b240 |
407 | (DeclStmt 0x173afc8 |
408 | 0x173af50 "int i = |
409 | (IntegerLiteral 0x173afa8 'int' 0)") |
410 | <<>> |
411 | (BinaryOperator 0x173b060 '_Bool' '<' |
412 | (ImplicitCastExpr 0x173b030 'int' |
413 | (DeclRefExpr 0x173afe0 'int' lvalue Var 0x173af50 'i' 'int')) |
414 | (ImplicitCastExpr 0x173b048 'int' |
415 | (DeclRefExpr 0x173b008 'const int' lvalue Var 0x170fa80 'N' 'const int'))) |
416 | (UnaryOperator 0x173b0b0 'int' lvalue prefix '++' |
417 | (DeclRefExpr 0x173b088 'int' lvalue Var 0x173af50 'i' 'int')) |
418 | (CompoundStatement ... |
419 | |
420 | We already know that the declaration and increments both match, or this |
421 | loop wouldn't have been dumped. The culprit lies in the implicit cast |
422 | applied to the first operand (i.e. the LHS) of the less-than operator, |
423 | an L-value to R-value conversion applied to the expression referencing |
424 | ``i``. Thankfully, the matcher library offers a solution to this problem |
425 | in the form of ``ignoringParenImpCasts``, which instructs the matcher to |
426 | ignore implicit casts and parentheses before continuing to match. |
427 | Adjusting the condition operator will restore the desired match. |
428 | |
429 | .. code-block:: c++ |
430 | |
431 | hasCondition(binaryOperator( |
432 | hasOperatorName("<"), |
433 | hasLHS(ignoringParenImpCasts(declRefExpr( |
434 | to(varDecl(hasType(isInteger())))))), |
435 | hasRHS(expr(hasType(isInteger()))))) |
436 | |
437 | After adding binds to the expressions we wished to capture and |
438 | extracting the identifier strings into variables, we have array-step-2 |
439 | completed. |
440 | |
441 | Step 4: Retrieving Matched Nodes |
442 | ================================ |
443 | |
444 | So far, the matcher callback isn't very interesting: it just dumps the |
445 | loop's AST. At some point, we will need to make changes to the input |
446 | source code. Next, we'll work on using the nodes we bound in the |
447 | previous step. |
448 | |
449 | The ``MatchFinder::run()`` callback takes a |
450 | ``MatchFinder::MatchResult&`` as its parameter. We're most interested in |
451 | its ``Context`` and ``Nodes`` members. Clang uses the ``ASTContext`` |
452 | class to represent contextual information about the AST, as the name |
453 | implies, though the most functionally important detail is that several |
454 | operations require an ``ASTContext*`` parameter. More immediately useful |
455 | is the set of matched nodes, and how we retrieve them. |
456 | |
457 | Since we bind three variables (identified by ConditionVarName, |
458 | InitVarName, and IncrementVarName), we can obtain the matched nodes by |
459 | using the ``getNodeAs()`` member function. |
460 | |
461 | In ``LoopConvert.cpp`` add |
462 | |
463 | .. code-block:: c++ |
464 | |
465 | #include "clang/AST/ASTContext.h" |
466 | |
467 | Change ``LoopMatcher`` to |
468 | |
469 | .. code-block:: c++ |
470 | |
471 | StatementMatcher LoopMatcher = |
472 | forStmt(hasLoopInit(declStmt( |
473 | hasSingleDecl(varDecl(hasInitializer(integerLiteral(equals(0)))) |
474 | .bind("initVarName")))), |
475 | hasIncrement(unaryOperator( |
476 | hasOperatorName("++"), |
477 | hasUnaryOperand(declRefExpr( |
478 | to(varDecl(hasType(isInteger())).bind("incVarName")))))), |
479 | hasCondition(binaryOperator( |
480 | hasOperatorName("<"), |
481 | hasLHS(ignoringParenImpCasts(declRefExpr( |
482 | to(varDecl(hasType(isInteger())).bind("condVarName"))))), |
483 | hasRHS(expr(hasType(isInteger())))))).bind("forLoop"); |
484 | |
485 | And change ``LoopPrinter::run`` to |
486 | |
487 | .. code-block:: c++ |
488 | |
489 | void LoopPrinter::run(const MatchFinder::MatchResult &Result) { |
490 | ASTContext *Context = Result.Context; |
491 | const ForStmt *FS = Result.Nodes.getNodeAs<ForStmt>("forLoop"); |
492 | // We do not want to convert header files! |
493 | if (!FS || !Context->getSourceManager().isWrittenInMainFile(FS->getForLoc())) |
494 | return; |
495 | const VarDecl *IncVar = Result.Nodes.getNodeAs<VarDecl>("incVarName"); |
496 | const VarDecl *CondVar = Result.Nodes.getNodeAs<VarDecl>("condVarName"); |
497 | const VarDecl *InitVar = Result.Nodes.getNodeAs<VarDecl>("initVarName"); |
498 | |
499 | if (!areSameVariable(IncVar, CondVar) || !areSameVariable(IncVar, InitVar)) |
500 | return; |
501 | llvm::outs() << "Potential array-based loop discovered.\n"; |
502 | } |
503 | |
504 | Clang associates a ``VarDecl`` with each variable to represent the variable's |
505 | declaration. Since the "canonical" form of each declaration is unique by |
506 | address, all we need to do is make sure neither ``ValueDecl`` (base class of |
507 | ``VarDecl``) is ``NULL`` and compare the canonical Decls. |
508 | |
509 | .. code-block:: c++ |
510 | |
511 | static bool areSameVariable(const ValueDecl *First, const ValueDecl *Second) { |
512 | return First && Second && |
513 | First->getCanonicalDecl() == Second->getCanonicalDecl(); |
514 | } |
515 | |
516 | If execution reaches the end of ``LoopPrinter::run()``, we know that the |
517 | loop shell that looks like |
518 | |
519 | .. code-block:: c++ |
520 | |
521 | for (int i= 0; i < expr(); ++i) { ... } |
522 | |
523 | For now, we will just print a message explaining that we found a loop. |
524 | The next section will deal with recursively traversing the AST to |
525 | discover all changes needed. |
526 | |
527 | As a side note, it's not as trivial to test if two expressions are the same, |
528 | though Clang has already done the hard work for us by providing a way to |
529 | canonicalize expressions: |
530 | |
531 | .. code-block:: c++ |
532 | |
533 | static bool areSameExpr(ASTContext *Context, const Expr *First, |
534 | const Expr *Second) { |
535 | if (!First || !Second) |
536 | return false; |
537 | llvm::FoldingSetNodeID FirstID, SecondID; |
538 | First->Profile(FirstID, *Context, true); |
539 | Second->Profile(SecondID, *Context, true); |
540 | return FirstID == SecondID; |
541 | } |
542 | |
543 | This code relies on the comparison between two |
544 | ``llvm::FoldingSetNodeIDs``. As the documentation for |
545 | ``Stmt::Profile()`` indicates, the ``Profile()`` member function builds |
546 | a description of a node in the AST, based on its properties, along with |
547 | those of its children. ``FoldingSetNodeID`` then serves as a hash we can |
548 | use to compare expressions. We will need ``areSameExpr`` later. Before |
549 | you run the new code on the additional loops added to |
550 | test-files/simple.cpp, try to figure out which ones will be considered |
551 | potentially convertible. |
552 | |