Optimizing Tasks in F#

Paul Biggar

25 Aug 2021 • 12 min read

I recently went through an optimization exercise. As part of Darklang's rewrite in F#, I needed to make sure that my new code was as fast as the old code - or at least, not horrendously slow.

A quick load test revealed the worst: the new code was absolutely, pathetically slow. The major difference—apart from the old code being in OCaml and the new code being in F#—was that the new code was async, using .NET Tasks. Clearly I was using them wrong. So I spent a lot of time learning about Tasks, and how to optimize computation using them. That's what this post is about. If you're into numbers and benchmarks, there's a lot here.

Quick context: Darklang allows you to build backends without any infra or deployment complexity. It does so by combining a programming language, a cloud, and an editor. As a result, you don't need to provision servers, DBs or queues, and deployment to Dark is instant.

The part of Dark that I'm discussing today is interpreter, so really the heart of the language. The interpreter is about 500 lines, and has Tasks fully baked through it.

Were Tasks the cause of the slowness?

As a quick aside though, Tasks were not actually to blame. I made a benchmark which showed that the OCaml version was finishing in 1ms, while the new F# version was finishing in 900ms on average. This is why I suspected Tasks. But the actual thing to blame was much more benign.

You see, Dark automatically traces all impure function calls in your programs, saving them to inspect in the editor. I accidentally marked all functions as impure, so every call to, for example, == or % were causing DB calls. Whoops!

Fixing this got me a 100x improvement, and the F# version now runs in 8ms. But, since I initially thought the problem was Tasks, I learned a lot about Tasks, so here we go.

Overview of Tasks in F#

A Task is how async is done in .NET. It is used to implement the C# async keyword, which is extremely similar to async keywords in JS, Python, Rust, etc. Whenever you've get some IO that needs to be done, async pauses current execution while the IO runs, allowing the CPU to be used elsewhere.

Note that F# has a construct called async, and I'm not talking about that at all. Anywhere in this post, even if I use the word async, I'm referring to Tasks.

Using Tasks is pretty easy: they're provided by lots of different APIs. To use them in C# you use the await keyword. To use them in F#, you use the task computation expression. This means your code looks like this:

let evalAsync (e : Expr) : Task<Dval> =
  task {
    match e with
    | EInt i -> return DInt i
    | EString s -> return DString s
    | ELet (lhs, rhs, body) ->
      let! rhs = evalAsync env st rhs
      let st = st.Add(lhs, rhs)
      return! evalAsync env st body
    | EFnCall (desc, args) ->
      match tryFindFn desc with
      | Some fn ->
          let! args = List.mapAsync args (evalAsync env st)
          return! callFnAsync env fn args
      | None -> return (err (NotAFunction desc))
  }

The task construct introduces let!, return and return!. let! x = myTask gets compiled into something morally equivalent to myTask.Bind(fun completed -> ...), so it's functionally the same concept as await. return wraps a non-Task in a Task, and return! just passes an existing Task onward [1].

Most await implementations automatically turn straight-line-looking code into optimized versions of them, for example C#'s awaits rewrite the method into a state machine. The task CE is a similar idea, rewriting the code using callbacks and continuations [2].

Wrap everything in Tasks

To get a sense of the performance cost of Tasks, I benchmark an implementation with no Tasks against one with Tasks. The benchmark was one I made a few months ago to ensure that the Dark rewrite would be fast enough. It's a cut-down version of the Darklang interpreter, running a simple FizzBuzz program:

List.range 1 100
|> List.map
     (fun v ->
       if v % 15 = 0 then "FizzBuzz"
       elif v % 3 = 0 then "Fizz"
       elif v % 5 = 0 then "Buzz"
       else toString v)

Feel free to read the actual benchmarks, which have lots of different versions of the interpreter, with different attempts at optimization, most of which are discussed below.

Here's the implementation with no Tasks, versus a naive task implementation:

	Requests/sec	Avg req duration	% of base case
No async	14824.56	7.00ms	100%
Tasks	6561.37	15.44ms	44%

As you can see, this is pretty bad!

Avoiding creating unnecessary Tasks

But of course, there's no need to create Tasks when we don't need them. I didn't need Tasks all the time: you can see from the code above that I'm wrapping everything in Tasks: Integers, Strings, etc— even if the code I'm executing has no async. In Dark, the only things that need async are making HTTP calls, touching the DB, and sending things to a queue. So we should find ways to avoid creating Tasks when we don't need them.

Optimization 1: sync functions

The first optimization to look at is the "sync caller" optimization. If we're calling a function which needs async, we do the full async thing. If we don't, no need to create a Task only to unwrap it later. Since the benchmark is mostly synchronous functions like % or == or toString, this should have a big effect.

	Requests/sec	Avg req duration	% of base case
No async	14824.56	7.00ms	100%
Tasks	6561.37	15.44ms	44%
Tasks w/ sync caller opt	7557.61	13.26ms	51%

A solid improvement, but still not great. Can we do even better?

Optimization 2: Hidden DTask

If it's expensive to wrap every Dval (Dark's value type) in a Task, let's try the other way around. Let's just add a Task to the Dval instead!

type Dval = 
  | DInt of int
  | DStr of string
  | DFloat of float
  | // etc
  | DTask of Task<Dval>

With this version, we only need to actually use a Task when we really use a Task. Everything else should perform at the same speed. With this, we don't even need the sync caller optimization.

	Requests/sec	Avg req duration	% of base case
No async	14824.56	7.00ms	100%
Tasks	6561.37	15.44ms	44%
Tasks w/ sync caller opt	7557.61	13.26ms	51%
Hidden DTask	13082.62	7.98ms	88%

Wow! That's nearly as fast as not having Tasks at all! Which makes sense considering the implementation.

However, this optimization kinda sucks in other ways. Firstly, it's part of the Dval so it's not generalizable - I can't use it to optimize other parts of the codebase that don't use a Dval directly. Secondly, it's not type-safe - it would be easy to accidentally lose a Task somewhere, and not get its value.

This is particularly a problem because I actually want to ensure that all Tasks complete before new code is executed. If you have two DB calls, let's say, a DB::set and then a DB::get, I want to make sure the DB::set completes before I start the DB::get. With the DTask implementation, that's extremely hard to get right, and it would require checking almost every usage in the codebase.

Optimization 2b: TaskOrValue

Ok, can we do this in a type-safe way? Yes we can, with this simple F# construct:

type TaskOrValue<'a> =
  | Value of 'a
  | Task of Task<'a>

Perfect! If there's a simple value, we just keep it with no Task overhead. And if there truly is a Task, it can be passed down to whatever is using it.

And to make this even better, I made taskv construct to make this seamless. This allowed to use TaskOrValues everywhere safely, just by replacing task with taskv. I'm not going to go into how taskv was implemented, except to say that it uses F#'s Computation Expressions, in a very similar way to task. The implementation is in the Dark repo.

	Requests/sec	Avg req duration	% of base case
No async	14824.56	7.00ms	100%
Tasks	6561.37	15.44ms	44%
Tasks w/ sync caller opt	7557.61	13.26ms	51%
Hidden DTask	13082.62	7.98ms	88%
TaskOrValue	5081.87	19.58ms	34%

Um, that's worse! Shit, that's a lot worse! And it's worse than the Task version. Maybe the sync caller optimization works here?

	Requests/sec	Avg req duration	% of base case
No async	14824.56	7.00ms	100%
Tasks	6561.37	15.44ms	44%
Tasks w/ sync caller opt	7557.61	13.26ms	51%
Hidden DTask	13082.62	7.98ms	88%
TaskOrValue	5081.87	19.58ms	34%
TaskOrValue w/ sync caller opt	8620.01	11.87ms	58%

OK, now we're getting somewhere. But it's still not much of an improvement.

Optimization 2b: ValueTask

It turns out .NET has a built-in version of my TaskOrValue, called a ValueTask. There's a lot to be said about ValueTasks, but I'll give you the basics:

They are the same concept as a TaskOrValue
They allow wrapping a non-Task very cheaply
They're structs so they go on the stack, not the heap
They reduce garbage collection overhead
BUT: you can only use them once!

Let's compare TaskOrValues to ValueTasks quickly:

ValueTask are on the stack, TaskOrValue are on the heap (though I briefly tried moving them to the stack using the Struct annotation, though I don't remember what the outcome was)
TaskOrValue can be used multiple times, ValueTasks can be used only once

Only using them once is an odd—and somewhat unfortunate—restriction: if you do this wrong, it will fail at run-time. This means that you can't do this:

task {
  let myTask = MethodThatReturnsAValueTask()
  let! y = myTask
  ... // do some other shit
  let! z = await myTask // BAD THINGS HAPPEN HERE
  ...
}

What would happen? Well, when you await myTask the first time, it returns the value but also gives up the memory, potentially using it again before the second usage.

(I understand that sort of thing makes sense in C#, but when you're trying to use a language with immutable values like F#, this kinda sucks. Especially because there's no type checker warning you about this, though I wonder if it would be possible to create an F# Analyzer that detects this.)

Fortunately, if you're using a F# computation expression, it's actually pretty hard to use a Task twice.

Actually using ValueTasks in F#

The biggest challenge I had with using ValueTasks is the documentation is, well, extremely challenging to understand, so here's what I learned (please correct me if I'm wrong):

The highest performing Task CEs for F# are in Ply.
If you just want to use regular Tasks in F#, use Ply's task computation expression. If you want to put a value in a Task, you can do task { return value } or Task.FromResult value. This this is an actual .NET Task, so nothing new here. (This is what I used in the benchmarks so far).
If you want it to be as fast as possible, use the uply CE, which uses a Ply<'a> type. A Ply is similar to a Task but lower overhead. To put a value in a Ply outside a uply block, you call Ply.Ply value. A Ply is similar to a ValueTask, do not use it twice [3]. You can convert a Ply to a Task using Ply.TplPrimitives.runPlyAsTask, or by doing task { return! uply { return 1 } }. Tasks can be also used directly in a uply using the let! syntax.
If you want to stay a little closer to the main path, but want to use a ValueTask, use vtask or uvtask. The difference between the two is that the second does not use an execution bubble, which gives you access to Async-local variables. If you don't need that, as I understand it, it seems safe to just use uvtask [4]. Remember that it's a ValueTask—do not use it twice. You can convert ValueTasks to full Tasks using .AsTask().
Native built-in tasks are coming in F# 6 (note that these benchmarks do NOT use the new F# 6 Task CEs)

Let's measure:

	Requests/sec	Avg req duration	% of base case
No async	14824.56	7.00ms	100%
Tasks	6561.37	15.44ms	44%
*Tasks	7557.61	13.26ms	51%
Hidden DTask	13082.62	7.98ms	88%
TaskOrValue	5081.87	19.58ms	34%
*TaskOrValue	8620.01	11.87ms	58%
*ValueTask (vtask)	8252.49	12.82ms	55%
*Unsafe ValueTask (uvtask)	11780.29	8.63ms	79%
*Ply (uply)	13530.62	7.68ms	91%
Ply (uply)	13568.88	7.54ms	91.5%

* indicates using Sync caller optimization

So we can see here that removing the execution bubble has a cumulative improvement with going to a ValueTask, and going to Ply gets us nearly to the performance of not having Async at all. It also doesn't have the safety challenge of the Hidden DTask, while being simpler and more performant!

An interesting point here is that with Ply, there is no longer a benefit to using the sync caller optimization. It must be so cheap to create a Ply that having a separate code path to avoid it isn't worth it.

Disclaimer

Let me finish by saying that these benchmarks are at best directional. I didn't run them a million times, remove top and bottom results, make sure my laptop wasn't running a video, etc. Also, they were run in Docker on mac, and virtualization isn't great for benchmarks. I observed different runs being 10% off frequently. I even reported to two decimal digits results that absolutely do not represent that level of accuracy! So don't pin too much on these.

Sprinkle on some flame bait

And so with that disclaimer, let's compare this to Rust and OCaml! To make these even more unsound, I ran these benchmarks on my Mac a while ago instead of in my Docker container today, so I took the old results and scaled them. There wasn't a great way to scale them, given that "scaling factor" of the 3 benchmarks that ran in both datasets are 51%, 63% and 64%, so I averaged it to 59%. So the Reqs/s persented for OCaml and Rust here are the old numbers, scaled by 59%. Meanwhile the % of base case used the reqs/s divided by the base case reported in the old results:

	Requests/sec	Avg req duration	% of base case
No async	14824.56	7.00ms	100%
Tasks	6561.37	15.44ms	44%
*Tasks	7557.61	13.26ms	51%
Hidden DTask	13082.62	7.98ms	88%
TaskOrValue	5081.87	19.58ms	34%
*TaskOrValue	8620.01	11.87ms	58%
*ValueTask (vtask)	8252.49	12.82ms	55%
*Unsafe ValueTask (uvtask)	11780.29	8.63ms	79%
*Ply (uply)	13530.62	7.68ms	91%
Ply (uply)	13568.88	7.54ms	91.5%
OCaml (sync-only)	7252	??	53%
OCaml async	5803	??	42.3%
Rust sync only	7000	??	51%

* indicates using Sync caller optimization

I mentioned above that the OCaml code runs in 1ms and the F# code runs in 9ms. Those aren't apples to apples comparisons. That said, neither are these benchmarks, as the OCaml implementation is single-threaded, while the F# implementation is not (and the benchmark machine has 8 cores).

I'm sure the Rust and OCaml numbers could improve if people spent as much time on them as I did the F# numbers (please, feel free!), especially since F# is on the latest .NET6 and the others are on old, no doubt less optimized, versions.

You should try Darklang.

Additional reading

Want to learn more about Tasks? Here's some things I read to find my way here:

It seems that Tasks are coming natively to F# 6! Check out the RFC and the recently merged PR
The source code for Ply
Info about ValueTasks
I didn't actually use this, but I enjoyed reading it! Writing High Performance F# code

Nino Floris, who wrote Ply, took the time to write detailed notes, which are so good I'm going to include them verbatim (I've also updated the copy where relevant):

[1] 'Technically' it binds to the task and stores any result as its own, passing itself onward.

[2] C# rewrites an entire method with N yield points into one big statemachine (in F# - limited by CE codegen - these are instead N continuations + allocs, extra method call overhead etc). F# 6 will see similar functionality as C# through the new resumable code functionality.

[3] in this case it's really not advised to use it twice as it's not wrapping a Task but an awaitable, a somewhat lower level thing meant to represent a single fiber of computation + associated state. Reusing a constant Ply.Ply value is fine though.

[4] it has to do with how things like AsyncLocals (and some other things like sync contexts) are scoped. A normal C# async await method sets up a scope/bubble such that any changes are reverted once that async method completes. Contrast that with normal sync methods where changes to an AsyncLocal will leak and stick around.

The problem is, for us to opt into that behavior we have to pass the CE's run function to a constructor which forces the function to become an actual function value (FSharpFunc derived class) allocated on the heap. All the 'u' versions of the CEs avoid that particular allocation with the tradeoff that any async local changes before the first yield now leak like a normal sync method.

You can see the difference in this tweet by David Fowler https://twitter.com/davidfowl/status/1296975203403042816

The benchmarks

Another ad

You should try Darklang. I wouldn't say it's exceptionally good yet, but it's got a bunch of ideas you don't see anywhere else in programming. Also you'll end up on the mailing list so I can send you more blog posts like this one.

Thanks to Phillip Carter and Nino Floris for providing feedback on drafts of this post!