Programming with OpenMP

The Intel® Fortran Compiler accepts a Fortran program containing OpenMP directives as input and produces a multithreaded version of the code. When the parallel program begins execution, a single thread exists. This thread is called the master thread. The master thread will continue to process serially until it encounters a parallel region.

Parallel Region

A parallel region is a block of code that must be executed by a team of threads in parallel. In the OpenMP Fortran API, a parallel construct is defined by placing OpenMP directives PARALLEL at the beginning and END PARALLEL at the end of the code segment. Code segments thus bounded can be executed in parallel.

A structured block of code is a collection of one or more executable statements with a single point of entry at the top and a single point of exit at the bottom.

The Intel Fortran Compiler supports worksharing and synchronization constructs. Each of these constructs consists of one or two specific OpenMP directives and sometimes the enclosed or following structured block of code. For complete definitions of constructs, see the OpenMP Fortran version 2.0 specifications.

At the end of the parallel region, threads wait until all team members have arrived. The team is logically disbanded (but may be reused in the next parallel region), and the master thread continues serial execution until it encounters the next parallel region.

Worksharing Construct

A worksharing construct divides the execution of the enclosed code region among the members of the team created on entering the enclosing parallel region. When the master thread enters a parallel region, a team of threads is formed. Starting from the beginning of the parallel region, code is replicated (executed by all team members) until a worksharing construct is encountered. A worksharing construct divides the execution of the enclosed code among the members of the team that encounter it.

The OpenMP SECTIONS or DO constructs are defined as worksharing constructs because they distribute the enclosed work among the threads of the current team. A worksharing construct is only distributed if it is encountered during dynamic execution of a parallel region. If the worksharing construct occurs lexically inside of the parallel region, then it is always executed by distributing the work among the team members. If the worksharing construct is not lexically (explicitly) enclosed by a parallel region (that is, it is orphaned), then the worksharing construct will be distributed among the team members of the closest dynamically-enclosing parallel region, if one exists. Otherwise, it will be executed serially.

When a thread reaches the end of a worksharing construct, it may wait until all team members within that construct have completed their work. When all of the work defined by the worksharing construct is finished, the team exits the worksharing construct and continues executing the code that follows.

A combined parallel/worksharing construct denotes a parallel region that contains only one worksharing construct.

Parallel Processing Directive Groups

The parallel processing directives include the following groups:

Parallel Region

Worksharing Construct

Combined Parallel/Worksharing Constructs

The combined parallel/worksharing constructs provide an abbreviated way to specify a parallel region that contains a single worksharing construct. The combined parallel/worksharing constructs are:

Synchronization and MASTER

Synchronization is the interthread communication that ensures the consistency of shared data and coordinates parallel execution among threads. Shared data is consistent within a team of threads when all threads obtain the identical value when the data is accessed. A synchronization construct is used to insure this consistency of the shared data.

See the list of OpenMP Directives and Clauses.

Data Sharing

Data sharing is specified at the start of a parallel region or worksharing construct by using the SHARED and PRIVATE clauses. All variables in the SHARED clause are shared among the members of a team. It is the application’s responsibility to:

In addition to shared and PRIVATE variables, individual variables and entire COMMON blocks can be privatized using the THREADPRIVATE directive.

Orphaned Directives

OpenMP contains a feature called orphaning which dramatically increases the expressiveness of parallel directives. Orphaning is a situation when directives related to a parallel region are not required to occur lexically within a single program unit. Directives such as CRITICAL, BARRIER, SECTIONS, SINGLE, MASTER and DO, can occur by themselves in a program unit, dynamically “binding” to the enclosing parallel region at run time.

Orphaned directives enable parallelism to be inserted into existing code with a minimum of code restructuring. Orphaning can also improve performance by enabling a single parallel region to bind with multiple do directives located within called subroutines. Consider the following code segment:

!$omp parallel
call phase1
call phase2
!$omp end parallel

subroutine phase1
!$omp do private(i) shared(n)
do i = 1, n
call some_work(i)
end do
!$omp end do

subroutine phase2
!$omp do private(j) shared(n)
do j = 1, n
call more_work(j)
end do
!$omp end do

The following orphaned directives usage rules apply

Preparing Code for OpenMP Processing

The following are the major stages and steps of preparing your code for using OpenMP. Typically, the first two stages can be done on uniprocessor or multiprocessor systems; later stages are typically done only on multiprocessor systems.

Before Inserting OpenMP Directives

Before inserting any OpenMP parallel directives, verify that your code is safe for parallel execution by doing the following:


Analysis includes the following major actions:


To restructure your program for successful OpenMP implementation, you can perform some or all of the following actions:

      1. If a chosen loop is able to execute iterations in parallel, introduce a PARALLEL DO construct around this loop.

      2. Try to remove any cross-iteration dependencies by rewriting the algorithm.

      3. Synchronize the remaining cross-iteration dependencies by placing CRITICAL  constructs around the uses and assignments to variables involved in the dependencies.

      4. List the variables that are present in the loop within appropriate SHARED, PRIVATE, LASTPRIVATE, FIRSTPRIVATE, or REDUCTION clauses.


The tuning process should include minimizing the sequential code in critical sections and load balancing by using the SCHEDULE clause or the omp_schedule environment variable.

This step is typically performed on a multiprocessor system.