Introducing Wrek - A Miniature Erlang Graph Engine

Richard Kallos 26.02.2018

Wrek is an Erlang library I wrote for concurrently executing task dependency graphs. It's intended purpose
is to run a set of pre-defined tasks that have a partial ordering between them. In this post, I explain why I
wrote wrek, what it can be used for, and how to use it.

Motivation

Wrek emerged as the result of two distinct forces. First, my amateurish enjoyment of graph theory made me
try to see graphs wherever I could, laying the conceptual foundation for this library. Second, I realized that a
project I was working on at Adgear would benefit from such a library, causing me to finally start writing
Wrek.

Conceptual

The graph is a pervasive data structure in computing. When I learned about graph algorithms in school, I
was amazed by their broad applications. Graphs play essential roles in compilers and build systems.
Various graph algorithms form the backbone of how we communicate with each other over the internet.

Our everyday lives tend to be filled with lists. We have to-do lists, shopping lists, recipes, checklists,
instructions, and much, much more. One day, I realized that some of these lists are deceiving. Some of
these lists are actually graphs hiding in list's clothing; these lists are topological orderings of directed acyclic
graphs
. Two of the more obvious cases of this are to-do lists and recipes. You can't send a letter that you
haven't written yet, nor can you cook spaghetti if you haven't boiled a pot of water.

Once the sneaky graphs were discovered, I spent some time thinking about how to benefit from treating
various lists like the DAGs that they secretly are. The clearest analogies between these lists and DAGs
were to-do lists and recipes. These very closely resembled dependency graphs. Unlike in their list
representation, these dependency graphs show which vertices (individual tasks) can be executed
concurrently. Sometimes this is obvious (e.g: I can chop vegetables while I wait for the pot of water to boil! I
can begin preparing a second baking sheet of cookies while the first batch is in the oven!), but I had my
hopes that the act of reformulating lists as dependency graphs could expose more opportunities for
concurrency.

          Boil water -- Add pasta -- Cook pasta --.
                                                                          \
Purée tomatoes --.                                              \
                               \                                              \
Chop vegetables -- Combine in saucepan -- Simmer sauce -- Combine pasta and sauce -- Serve
                         /
Add spices --'

Edges between vertices in the above graph express a partial ordering. Vertices with no path between them
can be executed concurrently. The relationship between vertices in the graph reflects the truth in the
kitchen. We can cook the pasta while our sauce is simmering. It doesn't matter whether we chop vegetables
or purée tomatoes first; they both need to be done if we're going to make any sauce.

Despite my best efforts, I am still a disaster in the kitchen. As a means of changing this inconvenient fact, I
thought it would be fun to try representing recipes as directed graphs, with extra data like how long each
step can be expected to take. This sat in my list of 'someday projects' for a long time, and I was only
reminded of it when I begin to think about a project I was working on at $JOB.

Practical

One of the projects I accomplished last year at Adgear was removing an expensive computation that was
taking place on each of our already maxed-out edge servers, and replaced it with a system that performed
the expensive computation on a single machine, then distributed the result. The system worked when
nobody touched it, but it was the programming equivalent of twigs and duct tape; cron jobs and shell
scripts. This system continued to be painful to use, and more examples of expensive computations being
done on CPU-starved machines were cropping up. At this point, it made sense start writing a more robust
system.

These expensive computations decomposed nicely into lists of steps to be performed. Fetch this data,
transform it, transform it some more, send some of the data to this group of servers, send this other data to
some other group of servers. Shell scripts are pretty good at encoding these pipelines, so it was an
acceptable choice at the time of implementation. After a while, it dawned on me that these lists of steps
were not really lists; they were dependency graphs. I tried to expose the latent concurrency in the shell
scripts by spicing them up with some background jobs and waitpid , but it was decided that it would
make more sense to switch to Erlang and benefit from all that OTP has to offer.

Let's Get 'Wrek' ed (I'm sorry. I couldn't resist.)

Wrek accepts dependency graphs like the above spaghetti recipe as input, and executes each vertex. The
nature of concurrently executing actions as soon as they are able to execute is generic to any problem that
can be represented by a dependency graph. The structure of a dependency graph, as well as the tasks
involved in each of the graph's vertices are specific to the user's wishes. Following the same
generic/specific split as OTP; the generic portion of this class of problems is solved by wrek and
wrek_vert modules. The specific portion is provided by the user in the form of an Erlang map
describing each vertex in the graph, and a set of callback modules that implement the wrek_vert
behaviour.

The wrek_vert behaviour consists of a single callback run/2, where the first argument is a list of
arguments to be sent to the callback function, and the second argument is the ID of a process which can
provide information generated by other vertices. The expected result of this callback function is either
{ok, Any :: any()} or {error, Reason :: any()} . If the callback function succeeds, Any
will be taken by wrek and made available to other wrek_vert processes. If the callback function
crashes or returns an error, the entire graph will be shut down.

Making Erlang Pasta

Following the contrived example above, let's set about using wrek to make pasta. Of course, our program
won't really make pasta, but it's output ought to fool somebody.

Looking at the above graph, each step seems to do one of three things: 1. Add an ingredient 2. Combine
ingredients in a vessel 3. Do something to ingredients in a vessel

Let's go ahead write some wrek_vert s for each of those actions. If you don't feel like following along in
a text editor, the full code can be found here.

-module(cook_add).
-behaviour(wrek_vert).
-export([run/2]).
run([Ingredient, Quantity], _Pid) ->
io:format("adding ~s. amount: ~s.~n", [Ingredient, Quantity]),
{ok, #{added => [{Ingredient, Quantity}]}}.

That's all for cook_add . It prints a message, then produces a map with a key added whose value is a
proplist with a single pair.

-module(cook_heat).
-behaviour(wrek_vert).
-export([run/2]).
run([Verb, Noun], _Pid) ->
io:format("~ping ~p.~n", [Verb, Noun]),
{ok, #{}}.

cook_heat is also pretty short. It's also very abstract. It could be used to print a message about Verb ing any Noun , not just cooking ingredients!

Our final callback module is a little longer, because it does a little bit more than printing a message.

-module(cook_combine).
-behaviour(wrek_vert).
-export([run/2]).
run([Ingredients, Vessel], Pid) ->
Fun = fun(Step, Acc) ->
Stuff = wrek_vert:get(Pid, Step, added),
io:format("combining ~p with ~p in ~p.~n", [Stuff, Acc, Vessel]),
Stuff ++ Acc
end,
Stuff = lists:foldl(Fun, [], Ingredients),
io:format("~p now contains: ~p.~n", [Vessel, Stuff]),
{ok, #{added => Stuff}}.

Ingredients is expected to be a list of vertex names. We finally use our parent process's Pid as an
argument to wrek_vert:get/3. This lets us consume the data produced by the cook_add callback
module. After combining everything, we returns a new collection of ingredients.

Alright! We're nearly done describing the specific portion of our problem! The last step is to represent our
dependency graph in terms of these callback modules and the arguments we want to pass to them.

-module(wrek_example).
-export([make_pasta/0]).
make_pasta() ->
application:ensure_all_started(wrek),
Graph = #{
tomatoes => #{
module => cook_add,
args => ["pureed tomatoes", "1 can"],
deps => []
},
vegetables => #{
module => cook_add,
args => ["chopped vegetables", "lots"],
deps => []
},
spices => #{
module => cook_add,
args => ["spices", "to taste"],
deps => []
},
saucepan => #{
module => cook_combine,
args => [[tomatoes, vegetables, spices], saucepan],
deps => [tomatoes, vegetables, spices]
},
simmer_sauce => #{
module => cook_heat,
args => [simmer, sauce],
deps => [saucepan]
},
boil_water => #{
module => cook_heat,
args => [boil, water],
deps => []
},
add_pasta => #{
module => cook_add,
args => ["pasta", "1 handful"],
deps => [boil_water]
},
cook_pasta => #{
module => cook_heat,
args => [cook, pasta],
deps => [add_pasta]
},
mix_pasta_with_sauce => #{
module => cook_combine,
args => [[saucepan, add_pasta], saucepan],
deps => [simmer_sauce, cook_pasta]
}
},
wrek:start(Graph).

That's an eyeful! What we've done here is created an Erlang map whose keys represent the names of
vertices in our original dependency graph, and whose values are maps that specify a callback module,
arguments to pass to the callback module, as well as any dependencies the vertex may have. I
congratulate those of you who noticed that we are never straining the pasta; I did confess to being a
disaster in the kitchen. I promise I learned my lesson.

Alright, we're done coding! Let's start up a shell and make some pasta!

$ rebar3 shell
[...]
1> wrek_example:make_pasta().
adding spices. amount: to taste.
adding puréed tomatoes. amount: 1 can.
adding chopped vegetables. amount: lots.
adding pasta. amount: 1 handful.
{ok,<0.133.0>}
2> boiling water.
cooking pasta.
combining [{"puréed tomatoes","1 can"}] with [] in saucepan.
combining [{"chopped vegetables","lots"}] with [{"puréed tomatoes","1 can"}] in sa
ucepan.
combining [{"spices","to taste"}] with [{"chopped vegetables","lots"},
{"puréed tomatoes","1 can"}] in saucepan.
saucepan now contains: [{"spices","to taste"},
{"chopped vegetables","lots"},
{"puréed tomatoes","1 can"}].
simmering sauce.
combining [{"spices","to taste"},
{"chopped vegetables","lots"},
{"puréed tomatoes","1 can"}] with [] in saucepan.
combining [{"pasta","1 handful"}] with [{"spices","to taste"},
{"chopped vegetables","lots"},
{"puréed tomatoes","1 can"}] in saucepan.
saucepan now contains: [{"pasta","1 handful"},
{"spices","to taste"},
{"chopped vegetables","lots"},
{"puréed tomatoes","1 can"}].

It isn't pretty, but neither is most of my cooking. However, we've just used wrek to teach Erlang how to
concurrently prepare pasta! If you run wrek_example:make_pasta() over and over, you may notice
that the order of the output changes, but never enough that the steps followed in order don't produce valid
pasta.

Uses for wrek, additional features, and plans for the future

I believe that Wrek is useful for more than pretending to make pasta. I used Wrek at $JOB to concurrently execute various small shell scripts, collect their output, and pass up to Erlang through and down to other
small shell scripts.

Wrek also contains features that I haven't covered in this article. For instance, you can pass a
gen_event process to wrek:start/2 to get Wrek to call gen_event:notify/2 when various
events take place. This gives users a way to monitor the execution of their graphs. Combined with Sergey
Aleynikov's erlexec library, you can collect stdout and stderr from any shell command you run
in a vertex using wrek_exec:exec/4 . I invite interested readers to take a peek at the source code for
Wrek.

Wrek is still a very young project, and I am still inexperienced at writing libraries. I am virtually certain that
the API is going to go through significant changes in the near future. I believe this is for the best, because
the whole process of specifying callback modules and passing data between vertices can be improved
dramatically. Here are some of the ideas I have for improving Wrek:

  • Add support for limiting the number of active wrek_vert processes
  • Avoid forced namespacing by vertex names, perhaps through the use of ETS tables as a tuple space
  • (experimental) Describing dependency graphs by annotating vertices by the data they produce and
  • consume, rather than by explicitly naming other vertices as dependencies.

Conclusion

Join Richard Kallos, developer at AdGear, at CodeBEAM SF where he will be giving a talk on this subject.

Richard Kallos
Author

Richard Kallos

By day, Richard Kallos writes code for AdGear. By night, he works towards his Master's degree in Computer Science. Richard's professional career started in the summer of 2016, where he was hired as an intern at AdGear to work on optimizing an in-house compiler. He continues to work at AdGear as a member of the backend team surrounded by brilliant and supportive coworkers.