First Operation: Map
Our approach to data-parallelization starts with the map, filter and reduce operations over arrays. Most of the common operations over large sets of data include these operations applied with a certain function and chained together like in the example below.
headlines = news.filter(_.section == "main")
.map( _.title ).reduce( _ + "<br />" + _)
Of those three functions, we’ll start with Map, which converts an input into some output, that may be of a different kind, according to a Lambda Function.
Example of Map usage in ÆminiumParallel that just doubles the values in the input array:
When the map method is called, it creates an instance of the Map operation that, just like any other GPU-powered operation follows the following steps:
- Preparing Source
- Preparing Buffers
- Executing
- Retrieving results
Preparing Sources
Right now we are exploring the possibility to use Java reflection on the Lambda object, to extract its body and translate it to OpenCL. However it is possible to go back to the compiler approach later on.
Status: Work in Progress
Preparing Buffers
In order to use the GPU, we must set the read-only, read/write or write-only buffers on the GPU memory, and fill them with the initial values.
Status: Done, just needs to add more types.
Execute
In this step the kernel prepared on step 1 is run with the set buffers. Arguments are set, and the operation is schedule on the device’s queue, with a future event reference being saved.
Status: Done
Retrieving Results
After the execution, the CPU must wait for the event to complete. Then it copies back the result to the hosts memory. We then create a new PList (GPU-backed Dynamic Array) based on the computed results.
Status: Done
What’s next?
- Add the other types.
- I’d like to explore the use of reflection for generating the OpenCL source code (Will take some time)
Initial Architecture
Æminium’s current architecture is split in two: a compiler and a runtime, me being one of the initial developers of the runtime. The compiler makes use of the Plaid framework and it’s currently under development. It is expected to generate valid Java code, that makes calls to the Æminium runtime, a set of libraries that schedules tasks on the CPUs according to dependencies between themselves. The runtime works on its own, but will probably need modifications once we get æminium code running there.
Since the compiler is still under development, it was decided to work on the Java side for now. So in order to see if the code is able to be run on the GPU 1 a pre-compiler will convert Java to Java, and replace some lambdas with OpenCL code if possible.
For its implementation we chose to use Polyglot 1 since we needed Java 1.5’s generics support. It allows me to make changes to the AST to replace certain chunks of code.

That code will be scheduled at runtime using a library that will not only make the bindings to the OpenCL driver, but also provide a set of high-level programming primitives for programmers to make use of data-parallel features in a higher level.
For this library, I am using JavaCL that proves low-level access to the drivers. It allows me to run a string of OpenCL code, and manage the memory transfer between the host (CPU) and the GPUs.
-
OpenCL has some restrictions on datatypes and operators available, which have to be checked. ↩
Introduction
Hello!
I am Alcides Fonseca and I will be your host. I will use this tumblr as my weekly log for the development of my Master Thesis on bringing the computation power of GPUs to the Æminium programming language.
About me
I’m on the second (and last) year of my Masters in Informatics Engineering at University of Coimbra. I’ve been programming for a long time, and Compilers and Programming Language Design is one my main interests in Computer Engineering (the other being HCI, weird combination, I know).
About Æminium
Æminium is both the name of the research project I am part of, and the name of the language that is supposed to come out. The idea came from Sven and it’s part of his PhD Thesis.
The main idea of Æminium is to have concurrence by default. And by that I mean that everything that can be done in parallel, will be. (Well, only if it makes it faster). Take for example the following code:
Line 1 and 2 can be run at the same time, and when they are both done, line 3 can be executed. That’s the main idea. “Well, what if long_function and another_long_function both change some global variable?”, you might ask. I didn’t say it was easy. There is the concept of data groups and permissions in this language that makes it possible, and more easy for programmers to write concurrent code.
About My Thesis
The main goal of my thesis is to extend Æminium to explore the computational power of GPUs. The parallelization in Æminium is a operation/expression kind of parallelism while GPUs are good for data parallelism (a lot of data going through the same kind of operation).
Stay tuned if you’re interested on this topic.