+46 18 660 700
Would you like to be contacted?
Quarterly Bulletin
Parallel processing improves performance

OpenCL and AMD Embedded G-Series APUs


“OpenCL is especially beneficial when the amount of data is large, the computations are complex, and you are able to reformulate your algorithm. In most cases you have to think of your algorithm and data a lot differently. It is a new way of thinking about your problem,” says Todd Roberts, S/W Engineering and Technical Marketing Manager at AMD.

Parallel processing is not new. Dual, quad and multicore processors have existed on the market for quite some time now. The new aspect is using all processing units in the system, each with its specialized capabilities. Using the combination of CPUs and GPUs together in concert offers the opportunity for major improvements in performance.

Great examples are found in the AMD Embedded G-Series APUs. They are for instance used in the Hectronic H6059 Qseven module and the Hectronic H6813 Mini-ITX motherboard. The APU options contain single or dual X86 cores with on-die powerful AMD Radeon HD 6310 or 6250 GPUs. The complete list of AMD Embedded G-series APU based products are found in the column to the right.

Making wider use of CPU and GPU

The OpenCL standard greatly simplifies programming these processor platforms to utilize the single/dual core CPUs and AMD Radeon GPUs. The standard helps in preserving the investments in software development for parallel processing across product lines and through generation shifts.


Todd Roberts will describe the details on how to write software applications based on the OpenCL standard in practice in a white paper further down on this web page. The white paper targets software engineers interested in a step-by-step practical introduction.

Direct link to white paper >>


But at first we’ll focus more on where to use OpenCL. I ask Todd Roberts to identify the application types with the potential to improve performance from using the GPU for processing data, apart from the more obvious task of graphics. It’s sometimes called General-Purpose computing on Graphics Processing Units, GPGPU.
 “Image processing and machine vision applications are really close to what GPUs were designed to do.  I know there has been a lot of work around image processing in the medical arena.” he says.


Evaluating network packet processing

Data communications is another area of interest. There has also been a lot of work around using OpenCL for network packet inspection and packet processing applications. In comparison with image and video processing, which is spot on for the GPU, the amount of data is smaller and the computations are less complex.
“We have found that if not handled properly the latency of getting the data into and out of the GPU is significant in these applications. You have to be very clever about how to use OpenCL and GPGPU for it to make sense.”

With current architectures the latency moving data to and from the GPU for processing is the price, so to speak, to pay when using OpenCL and GPGPU. Applications with less data and less complex computations need clever arrangements and a deep insight in the technical details to make the development effort worthwhile. In most cases using OpenCL in real world applications involves integrating it into existing software stacks.  For example, in the packet processing case, you have to get inside the network stack and at some point pick the data off, send it to the GPU where it processes and to stick it back into the stream. That complicates things quite a bit.


Adapting and tuning the software to the unique combination of CPU and GPU is crucial for optimal performance. OpenCL is in itself a programming standard independent of processor platform, but it’s one thing to merely have the application up and running and another thing to leverage the full potential of a specific platform and achieve the maximum performance.  AMD provides a set of tools to accelerate development on AMD platforms. There are tools for debugging, profiling, and analyzing the application to help engineers quickly realize their performance goals.


OpenCL optimized FFT and math libraries available

The parallel functionality of the GPU shows similarities with the characteristics of DSPs. According to Todd Roberts, AMD has put a lot of effort into supporting engineers that would like to replace DSPs with GPGPU and OpenCL.  “We’ve actually created various libraries like FFT libraries and math libraries that use OpenCL to perform various DSP and complex mathematics operations.”


AMDs vision of the future is to a large extent reflected in the initiative to create the HSA Foundation. The three letters stand for Heterogeneous Systems Architecture and the foundation was formed in June this year. The vision is one of future computer systems being made up from a variety of different processing units with their respective dedicated types of tasks and capabilities.
“It largely revolves around the way data moves back and forth between processing units and the ability to put different processing units in a system and have a common set of interconnects and methods for sending work to them in standardized way,” says Todd Roberts.


OpenCL is one step in the direction towards standardization of parallel processing in embedded PC systems. Software engineers that would like to investigate the possibilities a bit more in detail may find the following white paper interesting and informative.
“It provides an overview of the steps that you have to go through as setting up the queues, and creating the kernel, compiling the kernel and feeding it to the GPU,” says Todd Roberts, author of the following white paper.


Introduction to OpenCL Programming - White Paper

OpenCL has a flexible execution model that incorporates both task and data parallelism. Data Parallelism”). Tasks themselves are comprised of data-parallel kernels, which apply a single function over a range of data elements in parallel. Data movements between the host and compute devices, as well as OpenCL tasks, are coordinated via command queues. Where the concept of a kernel usually refers to the fundamental level of an operating system, here the term identifies a piece of code that executes on a given processing element.

An OpenCL command queue is created by the developer through an API call, and associated with a specific compute device. To execute a kernel, the kernel is pushed onto a particular command queue. Enqueueing a kernel can be done asynchronously, so that the host program may enqueue many different kernels without waiting for any of them to complete. When enqueueing a kernel, the developer optionally specifies a list of events that must occur before the kernel executes. If a developer wishes to target multiple OpenCL compute devices simultaneously, the developer would create multiple command queues.


Specifying a dependence graph

Command queues provide a general way of specifying relationships between tasks, ensuring that tasks are executed in an order that satisfies the natural dependences in the computation. The OpenCL runtime is free to execute tasks in parallel if their dependencies are satisfied, which provides a general-purpose task parallel execution model.

Events are generated by kernel completion, as well as memory read, write and copy commands.  This allows the developer to specify a dependence graph between kernel executions and memory transfers in a particular command queue or between command queues themselves, which the OpenCL runtime will traverse during execution. Figure 1 shows a task graph illustrating the power of this approach, where arrows indicate dependencies between tasks. For example, Kernel A will not execute until Write A and Write B have finished, and Kernel D will not execute until Kernel B and Kernel C have finished.


Click to enlarge the picture

Figure 1
Task parallelism within a command queue.

The ability to construct arbitrary task graphs is a powerful way of constructing task-parallel applications. The OpenCL runtime has the freedom to execute the task graph in parallel, as long as it respects the dependencies encoded in the task graph. Task graphs are general enough to represent the kinds of parallelism useful across the spectrum of hardware architectures, from CPUs to GPUs.

Using both data and task parallelism

Besides the task parallel constructs provided in OpenCL, which allow synchronization and communication between kernels, OpenCL supports local barrier synchronizations within a work group. This mechanism allows work items to coordinate and share data in the local memory space using only very lightweight and efficient barriers. Work items in different work groups should never try to synchronize or share data, since the runtime provides no guarantee that all work items are concurrently executing, and such synchronization easily introduces deadlocks.

Developers are also free to construct multiple command queues, either for parallelizing an application across multiple compute devices, or for expressing more parallelism via completely independent streams of computation. OpenCL’s ability to use both data and task parallelism simultaneously is a great benefit to parallel application developers, regardless of their intended hardware target.

As mentioned, OpenCL kernels provide data parallelism. The kernel execution model is based on a hierarchical abstraction of the computation being performed. OpenCL kernels are executed over an index space, which can be 1, 2 or 3 dimensional. In Figure 2, we see an example of a 2-dimensional index space, which has Gx * Gy elements. For every element of the kernel index space, a work item will be executed. All work items execute the same program, although their execution may differ due to branching based on data characteristics or the index assigned to each work item.


Click to enlarge the picture

Figure 2
Executing kernels - work groups and work items.


The index space is regularly subdivided into work groups, which are tilings of the entire index space. In Figure 2, we see a work group of size Sx * Sy elements. Each work item in the work group receives a work group ID, labeled (wx, wy) in the figure, as well as a local ID, labeled (sx, sy) in the figure. Each work item also receives a global ID, which can be derived from its work group and local IDs.

Work items in different work groups may coordinate execution through the use of atomic memory transactions, which are an OpenCL extension supported by some OpenCL runtimes. For example, work items may append variable numbers of results to a shared queue in global memory. However, it is good practice that work items do not, generally, attempt to communicate directly because without careful design, scalability and deadlock can become difficult problems. The hierarchy of synchronization and communication provided by OpenCL is a good fit for many of today’s parallel architectures, while still providing developers the ability to write efficient code, even for parallel computations with non-trivial synchronization and communication patterns.


Communication and synchronization

The work items may only communicate and synchronize locally, within a work group, via a barrier mechanism. This provides scalability, traditionally the bane of parallel programming. Because communication and synchronization at the finest granularity are restricted in scope, the OpenCL runtime has great freedom in how work items are scheduled and executed.

As already discussed, the core programming goal of OpenCL is to provide programmers with a data-parallel execution model. In practical terms this means that programmers can define a set of instructions that will be executed on a large number of data items at the same time. The most obvious example is to replace loops with functions (kernels) executing at each point in a problem domain.

Referring to Figures 3 and 4, let’s say you wanted to process a 1024 x 1024 image (your global problem dimension). You would initiate one kernel execution per pixel (1024 x 1024 = 1,048,576 kernel executions).

Figure 3 shows sample scalar code for processing an image. If you were writing very simple C code you would write a simple for loop, and in this for loop you would go from 1 to N and then perform your computation.


Click to enlarge the picture

Figure 3
Example of traditional loop (scalar).

An alternate way to do this would be in a data parallel fashion (Figure 4), and in this case you’re going to logically read one element in parallel from all of a (*a), multiply it from an element of b in parallel and write it to your output. You’ll notice that in Figure 4 there is no for loop—you get an ID value, read a value from a, multiply by a value from b and then write the output.

Click to enlarge the picture

Figure 4
Data parallel OpenCL.

As stated above, a properly written OpenCL application will operate correctly on a wide range of systems. While this is true, it should be noted that each system and compute device available to OpenCL may have different resources and characteristics that allow and sometimes require some level of tuning to achieve optimal performance. For example, OpenCL memory object types and sizes can impact performance. In most cases key parameters can be gathered from the OpenCL runtime to tune the operation of the application. In addition, each vendor may choose to provide extensions that provide for more options to tune your application. In most cases these are parameters used with the OpenCL API and should not require extensive rewrite of the algorithms.

Developing and application using OpenCL

An OpenCL application is built by first querying the runtime to determine which platforms are present. There can be any number of different OpenCL implementations installed on a single system. The desired OpenCL platform can be selected by matching the platform vendor string to the desired vendor name, such as “Advanced Micro Devices, Inc.” The next step is to create a context. An OpenCL context has associated with it a number of compute devices (for example, CPU or GPU devices). Within a context, OpenCL guarantees a relaxed consistency between these devices. This means that memory objects, such as buffers or images, are allocated per context; but changes made by one device are only guaranteed to be visible by another device at well-defined synchronization points. For this, OpenCL provides events with the ability to synchronize on a given event to enforce the correct order of execution.

Most OpenCL programs follow the same pattern. Given a specific platform, select a device or devices to create a context, allocate memory, create device-specific command queues, and perform data transfers and computations. Generally, the platform is the gateway to accessing specific devices, given these devices and a corresponding context. The application is independent of the platform. Given a context, the application can:

• Create one or more command queues.

• Create programs to run on one or more associated devices.

• Create kernels within those programs.

• Allocate memory buffers or images, either on the host or on the device(s)—Memory can

be copied between the host and device.

• Write data to the device.

• Submit the kernel (with appropriate arguments) to the command queue for execution.

• Read data back to the host from the device.


The relationship between context(s), device(s), buffer(s), program(s), kernel(s) and command queue(s) is best seen by looking at sample code.


Simple Buffer Write – OpenCL application example

Here is a simple programming example—a simple buffer write—with explanatory comments.

This code sample shows a minimalist OpenCL C program that sets a given buffer to some value. It illustrates the basic programming steps with a minimum amount of code. This sample contains no error checks and the code is not generalized. Yet, many simple test programs might look very similar. The entire code for this sample is provided in Code Block 1.


Click to enlarge the picture

Code Block 1


 1. The host program must select a platform, which is an abstraction for a given OpenCL implementation. Implementations by multiple vendors can coexist on a host, and the sample uses the first one available.


2. A device ID for a GPU device is requested. A CPU device could be requested by using CL_DEVICE_TYPE_CPU instead. The device can be a physical device, such as a given GPU, or an abstracted device, such as the collection of all CPU cores on the host.

3. On the selected device, an OpenCL context is created. A context ties together a device, memory buffers related to that device, OpenCL programs and command queues. Note that buffers related to a device can reside on either the host or the device. Many OpenCL programs have only a single context, program and command queue.

4. Before an OpenCL kernel can be launched, its program source is compiled, and a handle to the kernel is created.

5. A memory buffer is allocated on the device.

6. The kernel is launched. While it is necessary to specify the global work size, OpenCL determines a good local work size for this device. Since the kernel was launched asynchronously, clFinish() is used to wait for completion.

7. The data is mapped to the host for examination. Calling clEnqueueMapBuffer ensures the visibility of the buffer on the host, which in this case probably includes a physical transfer. Alternatively, we could use clEnqueueWriteBuffer(), which requires a pre-allocated host-side buffer.

OpenCL affords developers an elegant, non-proprietary programming platform to accelerate parallel processing performance for compute-intensive applications. With the ability to develop and maintain a single source code base that can be applied to CPUs, GPUs and APUs with equal ease, developers can achieve significant programming efficiency gains, reduce development costs, and speed their time-to-market.


Market segments

Meeting requirements from industry sectors.
Learn more


Promoting a deeper understanding of technical possibilities at hand.
Learn more

Case studies

Development and production for industrial customers.
Learn more

Bits & Pieces bulletin

Technical articles, inspiring case studies product news and updates. B&P is distributed quarterly to registered users.


Enter your e-mail address and click subscribe.


Find us on
Hectronic AB | Phone: +46 18 660 700 | E-mail: info@hectronic.se | Sitemap | Cookies
© 2019 Hectronic