1 | .. raw:: html |
2 | |
3 | <style type="text/css"> |
4 | .none { background-color: #FFCCCC } |
5 | .partial { background-color: #FFFF99 } |
6 | .good { background-color: #CCFF99 } |
7 | </style> |
8 | |
9 | .. role:: none |
10 | .. role:: partial |
11 | .. role:: good |
12 | |
13 | .. contents:: |
14 | :local: |
15 | |
16 | ================== |
17 | OpenMP Support |
18 | ================== |
19 | |
20 | Clang supports the following OpenMP 5.0 features |
21 | |
22 | * The `reduction`-based clauses in the `task` and `target`-based directives. |
23 | |
24 | * Support relational-op != (not-equal) as one of the canonical forms of random |
25 | access iterator. |
26 | |
27 | * Support for mapping of the lambdas in target regions. |
28 | |
29 | * Parsing/sema analysis for the requires directive. |
30 | |
31 | * Nested declare target directives. |
32 | |
33 | * Make the `this` pointer implicitly mapped as `map(this[:1])`. |
34 | |
35 | * The `close` *map-type-modifier*. |
36 | |
37 | Clang fully supports OpenMP 4.5. Clang supports offloading to X86_64, AArch64, |
38 | PPC64[LE] and has `basic support for Cuda devices`_. |
39 | |
40 | * #pragma omp declare simd: :partial:`Partial`. We support parsing/semantic |
41 | analysis + generation of special attributes for X86 target, but still |
42 | missing the LLVM pass for vectorization. |
43 | |
44 | In addition, the LLVM OpenMP runtime `libomp` supports the OpenMP Tools |
45 | Interface (OMPT) on x86, x86_64, AArch64, and PPC64 on Linux, Windows, and macOS. |
46 | |
47 | General improvements |
48 | -------------------- |
49 | - New collapse clause scheme to avoid expensive remainder operations. |
50 | Compute loop index variables after collapsing a loop nest via the |
51 | collapse clause by replacing the expensive remainder operation with |
52 | multiplications and additions. |
53 | |
54 | - The default schedules for the `distribute` and `for` constructs in a |
55 | parallel region and in SPMD mode have changed to ensure coalesced |
56 | accesses. For the `distribute` construct, a static schedule is used |
57 | with a chunk size equal to the number of threads per team (default |
58 | value of threads or as specified by the `thread_limit` clause if |
59 | present). For the `for` construct, the schedule is static with chunk |
60 | size of one. |
61 | |
62 | - Simplified SPMD code generation for `distribute parallel for` when |
63 | the new default schedules are applicable. |
64 | |
65 | .. _basic support for Cuda devices: |
66 | |
67 | Cuda devices support |
68 | ==================== |
69 | |
70 | Directives execution modes |
71 | -------------------------- |
72 | |
73 | Clang code generation for target regions supports two modes: the SPMD and |
74 | non-SPMD modes. Clang chooses one of these two modes automatically based on the |
75 | way directives and clauses on those directives are used. The SPMD mode uses a |
76 | simplified set of runtime functions thus increasing performance at the cost of |
77 | supporting some OpenMP features. The non-SPMD mode is the most generic mode and |
78 | supports all currently available OpenMP features. The compiler will always |
79 | attempt to use the SPMD mode wherever possible. SPMD mode will not be used if: |
80 | |
81 | - The target region contains an `if()` clause that refers to a `parallel` |
82 | directive. |
83 | |
84 | - The target region contains a `parallel` directive with a `num_threads()` |
85 | clause. |
86 | |
87 | - The target region contains user code (other than OpenMP-specific |
88 | directives) in between the `target` and the `parallel` directives. |
89 | |
90 | Data-sharing modes |
91 | ------------------ |
92 | |
93 | Clang supports two data-sharing models for Cuda devices: `Generic` and `Cuda` |
94 | modes. The default mode is `Generic`. `Cuda` mode can give an additional |
95 | performance and can be activated using the `-fopenmp-cuda-mode` flag. In |
96 | `Generic` mode all local variables that can be shared in the parallel regions |
97 | are stored in the global memory. In `Cuda` mode local variables are not shared |
98 | between the threads and it is user responsibility to share the required data |
99 | between the threads in the parallel regions. |
100 | |
101 | Collapsed loop nest counter |
102 | --------------------------- |
103 | |
104 | When using the collapse clause on a loop nest the default behavior is to |
105 | automatically extend the representation of the loop counter to 64 bits for |
106 | the cases where the sizes of the collapsed loops are not known at compile |
107 | time. To prevent this conservative choice and use at most 32 bits, |
108 | compile your program with the `-fopenmp-optimistic-collapse`. |
109 | |
110 | |
111 | Features not supported or with limited support for Cuda devices |
112 | --------------------------------------------------------------- |
113 | |
114 | - Cancellation constructs are not supported. |
115 | |
116 | - Doacross loop nest is not supported. |
117 | |
118 | - User-defined reductions are supported only for trivial types. |
119 | |
120 | - Nested parallelism: inner parallel regions are executed sequentially. |
121 | |
122 | - Static linking of libraries containing device code is not supported yet. |
123 | |
124 | - Automatic translation of math functions in target regions to device-specific |
125 | math functions is not implemented yet. |
126 | |
127 | - Debug information for OpenMP target regions is supported, but sometimes it may |
128 | be required to manually specify the address class of the inspected variables. |
129 | In some cases the local variables are actually allocated in the global memory, |
130 | but the debug info may be not aware of it. |
131 | |
132 | |