Optimizing Tasks in F#
I recently went through an optimization exercise. As part of Darklang's rewrite in F#, I needed to make sure that my new code was as fast as the old code - or at least, not horrendously slow.
A quick load test revealed the worst: the new code was absolutely, pathetically slow. The major difference—apart from the old code being in OCaml and the new code being in F#—was that the new code was async, using .NET Tasks. Clearly I was using them wrong. So I spent a lot of time learning about Tasks, and how to optimize computation using them. That's what this post is about. If you're into numbers and benchmarks, there's a lot here.
Quick context: Darklang allows you to build backends without any infra or deployment complexity. It does so by combining a programming language, a cloud, and an editor. As a result, you don't need to provision servers, DBs or queues, and deployment to Dark is instant.
The part of Dark that I'm discussing today is interpreter, so really the heart of the language. The interpreter is about 500 lines, and has Tasks fully baked through it.
Were Tasks the cause of the slowness?
As a quick aside though, Tasks were not actually to blame. I made a benchmark which showed that the OCaml version was finishing in 1ms, while the new F# version was finishing in 900ms on average. This is why I suspected Tasks. But the actual thing to blame was much more benign.
You see, Dark automatically traces all impure function calls in your programs, saving them to inspect in the editor. I accidentally marked all functions as impure, so every call to, for example, ==
or %
were causing DB calls. Whoops!
Fixing this got me a 100x improvement, and the F# version now runs in 8ms. But, since I initially thought the problem was Tasks, I learned a lot about Tasks, so here we go.
Overview of Tasks in F#
A Task is how async
is done in .NET. It is used to implement the C# async
keyword, which is extremely similar to async
keywords in JS, Python, Rust, etc. Whenever you've get some IO that needs to be done, async
pauses current execution while the IO runs, allowing the CPU to be used elsewhere.
Note that F# has a construct called async
, and I'm not talking about that at all. Anywhere in this post, even if I use the word async, I'm referring to Tasks.
Using Tasks is pretty easy: they're provided by lots of different APIs. To use them in C# you use the await
keyword. To use them in F#, you use the task
computation expression. This means your code looks like this:
let evalAsync (e : Expr) : Task<Dval> =
task {
match e with
| EInt i -> return DInt i
| EString s -> return DString s
| ELet (lhs, rhs, body) ->
let! rhs = evalAsync env st rhs
let st = st.Add(lhs, rhs)
return! evalAsync env st body
| EFnCall (desc, args) ->
match tryFindFn desc with
| Some fn ->
let! args = List.mapAsync args (evalAsync env st)
return! callFnAsync env fn args
| None -> return (err (NotAFunction desc))
}
The task
construct introduces let!
, return
and return!
. let! x = myTask
gets compiled into something morally equivalent to myTask.Bind(fun completed -> ...)
, so it's functionally the same concept as await
. return
wraps a non-Task in a Task, and return!
just passes an existing Task onward [1].
Most await
implementations automatically turn straight-line-looking code into optimized versions of them, for example C#'s awaits rewrite the method into a state machine. The task
CE is a similar idea, rewriting the code using callbacks and continuations [2].
Wrap everything in Tasks
To get a sense of the performance cost of Tasks, I benchmark an implementation with no Tasks against one with Tasks. The benchmark was one I made a few months ago to ensure that the Dark rewrite would be fast enough. It's a cut-down version of the Darklang interpreter, running a simple FizzBuzz program:
List.range 1 100
|> List.map
(fun v ->
if v % 15 = 0 then "FizzBuzz"
elif v % 3 = 0 then "Fizz"
elif v % 5 = 0 then "Buzz"
else toString v)
Feel free to read the actual benchmarks, which have lots of different versions of the interpreter, with different attempts at optimization, most of which are discussed below.
Here's the implementation with no Tasks, versus a naive task implementation:
Requests/sec | Avg req duration | % of base case | |
---|---|---|---|
No async | 14824.56 | 7.00ms | 100% |
Tasks | 6561.37 | 15.44ms | 44% |
As you can see, this is pretty bad!
Avoiding creating unnecessary Tasks
But of course, there's no need to create Tasks when we don't need them. I didn't need Tasks all the time: you can see from the code above that I'm wrapping everything in Tasks: Integers, Strings, etc— even if the code I'm executing has no async. In Dark, the only things that need async are making HTTP calls, touching the DB, and sending things to a queue. So we should find ways to avoid creating Tasks when we don't need them.
Optimization 1: sync functions
The first optimization to look at is the "sync caller" optimization. If we're calling a function which needs async, we do the full async thing. If we don't, no need to create a Task only to unwrap it later. Since the benchmark is mostly synchronous functions like %
or ==
or toString
, this should have a big effect.
Requests/sec | Avg req duration | % of base case | |
---|---|---|---|
No async | 14824.56 | 7.00ms | 100% |
Tasks | 6561.37 | 15.44ms | 44% |
Tasks w/ sync caller opt | 7557.61 | 13.26ms | 51% |
A solid improvement, but still not great. Can we do even better?
Optimization 2: Hidden DTask
If it's expensive to wrap every Dval
(Dark's value type) in a Task, let's try the other way around. Let's just add a Task to the Dval
instead!
type Dval =
| DInt of int
| DStr of string
| DFloat of float
| // etc
| DTask of Task<Dval>
With this version, we only need to actually use a Task when we really use a Task. Everything else should perform at the same speed. With this, we don't even need the sync caller optimization.
Requests/sec | Avg req duration | % of base case | |
---|---|---|---|
No async | 14824.56 | 7.00ms | 100% |
Tasks | 6561.37 | 15.44ms | 44% |
Tasks w/ sync caller opt | 7557.61 | 13.26ms | 51% |
Hidden DTask | 13082.62 | 7.98ms | 88% |
Wow! That's nearly as fast as not having Tasks at all! Which makes sense considering the implementation.
However, this optimization kinda sucks in other ways. Firstly, it's part of the Dval
so it's not generalizable - I can't use it to optimize other parts of the codebase that don't use a Dval
directly. Secondly, it's not type-safe - it would be easy to accidentally lose a Task somewhere, and not get its value.
This is particularly a problem because I actually want to ensure that all Tasks complete before new code is executed. If you have two DB calls, let's say, a DB::set
and then a DB::get
, I want to make sure the DB::set
completes before I start the DB::get
. With the DTask
implementation, that's extremely hard to get right, and it would require checking almost every usage in the codebase.
Optimization 2b: TaskOrValue
Ok, can we do this in a type-safe way? Yes we can, with this simple F# construct:
type TaskOrValue<'a> =
| Value of 'a
| Task of Task<'a>
Perfect! If there's a simple value, we just keep it with no Task overhead. And if there truly is a Task, it can be passed down to whatever is using it.
And to make this even better, I made taskv
construct to make this seamless. This allowed to use TaskOrValues
everywhere safely, just by replacing task
with taskv
. I'm not going to go into how taskv
was implemented, except to say that it uses F#'s Computation Expressions, in a very similar way to task
. The implementation is in the Dark repo.
Requests/sec | Avg req duration | % of base case | |
---|---|---|---|
No async | 14824.56 | 7.00ms | 100% |
Tasks | 6561.37 | 15.44ms | 44% |
Tasks w/ sync caller opt | 7557.61 | 13.26ms | 51% |
Hidden DTask | 13082.62 | 7.98ms | 88% |
TaskOrValue | 5081.87 | 19.58ms | 34% |
Um, that's worse! Shit, that's a lot worse! And it's worse than the Task
version. Maybe the sync caller optimization works here?
Requests/sec | Avg req duration | % of base case | |
---|---|---|---|
No async | 14824.56 | 7.00ms | 100% |
Tasks | 6561.37 | 15.44ms | 44% |
Tasks w/ sync caller opt | 7557.61 | 13.26ms | 51% |
Hidden DTask | 13082.62 | 7.98ms | 88% |
TaskOrValue | 5081.87 | 19.58ms | 34% |
TaskOrValue w/ sync caller opt | 8620.01 | 11.87ms | 58% |
OK, now we're getting somewhere. But it's still not much of an improvement.
Optimization 2b: ValueTask
It turns out .NET has a built-in version of my TaskOrValue
, called a ValueTask
. There's a lot to be said about ValueTasks, but I'll give you the basics:
- They are the same concept as a TaskOrValue
- They allow wrapping a non-Task very cheaply
- They're structs so they go on the stack, not the heap
- They reduce garbage collection overhead
- BUT: you can only use them once!
Let's compare TaskOrValues to ValueTasks quickly:
- ValueTask are on the stack, TaskOrValue are on the heap (though I briefly tried moving them to the stack using the
Struct
annotation, though I don't remember what the outcome was) - TaskOrValue can be used multiple times, ValueTasks can be used only once
Only using them once is an odd—and somewhat unfortunate—restriction: if you do this wrong, it will fail at run-time. This means that you can't do this:
task {
let myTask = MethodThatReturnsAValueTask()
let! y = myTask
... // do some other shit
let! z = await myTask // BAD THINGS HAPPEN HERE
...
}
What would happen? Well, when you await
myTask
the first time, it returns the value but also gives up the memory, potentially using it again before the second usage.
(I understand that sort of thing makes sense in C#, but when you're trying to use a language with immutable values like F#, this kinda sucks. Especially because there's no type checker warning you about this, though I wonder if it would be possible to create an F# Analyzer that detects this.)
Fortunately, if you're using a F# computation expression, it's actually pretty hard to use a Task twice.
Actually using ValueTasks in F#
The biggest challenge I had with using ValueTasks is the documentation is, well, extremely challenging to understand, so here's what I learned (please correct me if I'm wrong):
- The highest performing Task CEs for F# are in Ply.
- If you just want to use regular Tasks in F#, use Ply's
task
computation expression. If you want to put a value in a Task, you can dotask { return value }
orTask.FromResult value
. This this is an actual .NET Task, so nothing new here. (This is what I used in the benchmarks so far). - If you want it to be as fast as possible, use the
uply
CE, which uses aPly<'a>
type. A Ply is similar to a Task but lower overhead. To put a value in a Ply outside auply
block, you callPly.Ply value
. A Ply is similar to a ValueTask, do not use it twice [3]. You can convert a Ply to a Task usingPly.TplPrimitives.runPlyAsTask
, or by doingtask { return! uply { return 1 } }
. Tasks can be also used directly in auply
using thelet!
syntax. - If you want to stay a little closer to the main path, but want to use a ValueTask, use
vtask
oruvtask
. The difference between the two is that the second does not use an execution bubble, which gives you access to Async-local variables. If you don't need that, as I understand it, it seems safe to just useuvtask
[4]. Remember that it's a ValueTask—do not use it twice. You can convert ValueTasks to full Tasks using.AsTask()
. - Native built-in tasks are coming in F# 6 (note that these benchmarks do NOT use the new F# 6 Task CEs)
Let's measure:
Requests/sec | Avg req duration | % of base case | |
---|---|---|---|
No async | 14824.56 | 7.00ms | 100% |
Tasks | 6561.37 | 15.44ms | 44% |
*Tasks | 7557.61 | 13.26ms | 51% |
Hidden DTask | 13082.62 | 7.98ms | 88% |
TaskOrValue | 5081.87 | 19.58ms | 34% |
*TaskOrValue | 8620.01 | 11.87ms | 58% |
*ValueTask (vtask) | 8252.49 | 12.82ms | 55% |
*Unsafe ValueTask (uvtask) | 11780.29 | 8.63ms | 79% |
*Ply (uply) | 13530.62 | 7.68ms | 91% |
Ply (uply) | 13568.88 | 7.54ms | 91.5% |
So we can see here that removing the execution bubble has a cumulative improvement with going to a ValueTask, and going to Ply gets us nearly to the performance of not having Async at all. It also doesn't have the safety challenge of the Hidden DTask, while being simpler and more performant!
An interesting point here is that with Ply, there is no longer a benefit to using the sync caller optimization. It must be so cheap to create a Ply that having a separate code path to avoid it isn't worth it.
Disclaimer
Let me finish by saying that these benchmarks are at best directional. I didn't run them a million times, remove top and bottom results, make sure my laptop wasn't running a video, etc. Also, they were run in Docker on mac, and virtualization isn't great for benchmarks. I observed different runs being 10% off frequently. I even reported to two decimal digits results that absolutely do not represent that level of accuracy! So don't pin too much on these.
Sprinkle on some flame bait
And so with that disclaimer, let's compare this to Rust and OCaml! To make these even more unsound, I ran these benchmarks on my Mac a while ago instead of in my Docker container today, so I took the old results and scaled them. There wasn't a great way to scale them, given that "scaling factor" of the 3 benchmarks that ran in both datasets are 51%, 63% and 64%, so I averaged it to 59%. So the Reqs/s persented for OCaml and Rust here are the old numbers, scaled by 59%. Meanwhile the % of base case used the reqs/s divided by the base case reported in the old results:
Requests/sec | Avg req duration | % of base case | |
---|---|---|---|
No async | 14824.56 | 7.00ms | 100% |
Tasks | 6561.37 | 15.44ms | 44% |
*Tasks | 7557.61 | 13.26ms | 51% |
Hidden DTask | 13082.62 | 7.98ms | 88% |
TaskOrValue | 5081.87 | 19.58ms | 34% |
*TaskOrValue | 8620.01 | 11.87ms | 58% |
*ValueTask (vtask) | 8252.49 | 12.82ms | 55% |
*Unsafe ValueTask (uvtask) | 11780.29 | 8.63ms | 79% |
*Ply (uply) | 13530.62 | 7.68ms | 91% |
Ply (uply) | 13568.88 | 7.54ms | 91.5% |
OCaml (sync-only) | 7252 | ?? | 53% |
OCaml async | 5803 | ?? | 42.3% |
Rust sync only | 7000 | ?? | 51% |
* indicates using Sync caller optimization
I mentioned above that the OCaml code runs in 1ms and the F# code runs in 9ms. Those aren't apples to apples comparisons. That said, neither are these benchmarks, as the OCaml implementation is single-threaded, while the F# implementation is not (and the benchmark machine has 8 cores).
I'm sure the Rust and OCaml numbers could improve if people spent as much time on them as I did the F# numbers (please, feel free!), especially since F# is on the latest .NET6 and the others are on old, no doubt less optimized, versions.
Ad
You should try Darklang.
Additional reading
Want to learn more about Tasks? Here's some things I read to find my way here:
- It seems that Tasks are coming natively to F# 6! Check out the RFC and the recently merged PR
- The source code for Ply
- Info about ValueTasks
- I didn't actually use this, but I enjoyed reading it! Writing High Performance F# code
Nino Floris, who wrote Ply, took the time to write detailed notes, which are so good I'm going to include them verbatim (I've also updated the copy where relevant):
[1] 'Technically' it binds to the task and stores any result as its own, passing itself onward.
[2] C# rewrites an entire method with N yield points into one big statemachine (in F# - limited by CE codegen - these are instead N continuations + allocs, extra method call overhead etc). F# 6 will see similar functionality as C# through the new resumable code functionality.
[3] in this case it's really not advised to use it twice as it's not wrapping a Task but an awaitable, a somewhat lower level thing meant to represent a single fiber of computation + associated state. Reusing a constant Ply.Ply value is fine though.
[4] it has to do with how things like AsyncLocals (and some other things like sync contexts) are scoped. A normal C# async await method sets up a scope/bubble such that any changes are reverted once that async method completes. Contrast that with normal sync methods where changes to an AsyncLocal will leak and stick around.
The problem is, for us to opt into that behavior we have to pass the CE's run function to a constructor which forces the function to become an actual function value (FSharpFunc derived class) allocated on the heap. All the 'u' versions of the CEs avoid that particular allocation with the tradeoff that any async local changes before the first yield now leak like a normal sync method.
You can see the difference in this tweet by David Fowler https://twitter.com/davidfowl/status/1296975203403042816
The benchmarks
- No async
- Tasks
- Tasks w/ sync opt
- Hidden DTask
- TaskOrValue
- TaskOrValue w/ sync opt
- ValueTask (vtask) w/ sync opt
- Unsafe ValueTask (uvtask) w/ sync opt
- Ply (uply) w/ sync opt
- Ply (uply)
- Rust (sync only)
- OCaml (sync)
- OCaml async
Another ad
You should try Darklang. I wouldn't say it's exceptionally good yet, but it's got a bunch of ideas you don't see anywhere else in programming. Also you'll end up on the mailing list so I can send you more blog posts like this one.
Thanks to Phillip Carter and Nino Floris for providing feedback on drafts of this post!