Tuesday, 20 December 2011

Golang: goroutines performance

Intro


In this post I'll try to measure goroutines performance. Goroutines are something like lightweight threads. It is a built-in Go's primitive providing multitasking (together with channels).

Documentation tells us:

It is practical to create hundreds of thousands of goroutines in the same address space.

So the point of this post is to check it and to figure out how the performance suffers from using such a big number of concurrently running functions.

Memory


The size of a newly minted goroutine is not documented. It is said to be about few kilobytes. Tests on different machines help to clarify this number to 4 — 4,5Kb. So 5Gb is more than enough to run 1 million goroutines.

Performance


Let us figure out how much speed we lose when running function in a goroutine. As you probably know it is very easy — just add go keyword before the function call:

go testFunc()

Goroutines are multiplexed on threads. By default, if there is no GOMAXPROCS environment variable, program uses only one thread. To take advantage of all CPU cores you need to specify their number. For example:

export GOMAXPROCS=2

This value is used in the runtime. So there is no need to recompile program every time you change this value.

As far as I can presume, the time is spent mostly on goroutines creation, switching among them, and also sometimes on moving goroutines into others threads and on communications of goroutines from different threads. To avoid last cases, let us start from using only one thread.

All actions were performed on my nettop:

  • Atom D525 Dual Core 1.8 GHz
  • 4Gb DDR3
  • Go r60.3
  • Arch Linux x86_64

Methodology


Here is a test functions generator:

func genTest (n int) func (res chan <- interface {}) {
        return func(res chan <- interface {}) {
                for i := 0; i < n; i++ {
                        math.Sqrt(13)
                }
                res <- true
        }
}

And this is how we get a set of functions calculating sqrt(13) 1, 10, 100, 1000 and 5000 times respectively:

testFuncs := [] func (chan <- interface {}) { genTest(1), genTest(10), genTest(100), genTest(1000), genTest(5000) }

I'm running each function X times in loop and then in X goroutines. Then results are compared. Sure, garbage collection should be kept in mind. To reduce its influence I explicitly call runtime.GC() after all goroutines finish and only then note the finish time. Of course, for the accuracy each test is performed many times. Total run time took about 16 hours.

One thread


export GOMAXPROCS=1

gorounes performance 1_1

The graph shows that the function whose running time is approximately equal to the computation of the sqrt() works in goroutine about 4 times slower.

Let us consider 4 remaining functions more in detail:

gorounes performance 1_2

You can see, that even 700 thousand simultaneously running goroutines don't reduce performance more than by 80%. And now the most awesome. Starting from sqrt()x1000, the overhead is less than ~2%. 5000 times — only 1%. And it looks like this number does not depend on the number of goroutines! So the single limitation is memory.

Resume:

If the run time of independent code is more than calculating square root 10 times, and you want to run it concurrently, do not hesitate to run it in goroutine. Although, if you can easily collect 10 or even 100 such code parts together, the loss of performance will be only 20% or 2% respectively.

Several threads


Now let us consider situation when we want to use several processor cores. In my case there are 2 ones:

export GOMAXPROCS=2

Running our testing program again:

gorounes performance 2_1

Here you can see that despite the fact that the number of cores has doubled, the run time of the first two functions increased! It is probably because moving them to another thread is more expensive than just executing :) Currently scheduler can not resolve such situations, but Go authors assure it will be done in the future.

gorounes performance 2_2

As you can see last two functions use both cores to the fullest. On my nettop their execution time is ~45µs and ~230µs respectively.

Conclusion


Evan despite the language youth and temporary scheduler implementation, goroutines performance feels amazing. Especially when combined with Go's simplicity. I'm very impressed. Many thanks to Go Dev team!

I can also propose to think twice before running goroutines whose lifetime is less than 1µs, and not to hesitate if its lifetime is more than 1ms :)

1 comment: