Concurrent code slower than sequential code on parallel problem?

Issue

I wrote some code to execute Monte Carlo simulations. The first thing I wrote was this sequential version:

 func simulationSequential(experiment func() bool, numTrials int) float64 {
    ocurrencesEvent := 0

    for trial := 0; trial < numTrials; trial++ {
        eventHappend := experiment()
        if eventHappend {
            ocurrencesEvent++
        }
    }

    return float64(ocurrencesEvent) / float64(numTrials)
}

Then, I figured I could run some of the experiments concurrently and get a result faster using my laptop’s multiple cores. So, I wrote the following version:

func simulationConcurrent(experiment func() bool, numTrials, nGoroutines int) float64 {
    ch := make(chan int)
    var wg sync.WaitGroup

    // Launch work in multiple goroutines
    for i := 0; i < nGoroutines; i++ {
        wg.Add(1)
        go func() {
            localOcurrences := 0
            for j := 0; j < numTrials/nGoroutines; j++ {
                eventHappend := experiment()
                if eventHappend {
                    localOcurrences++
                }
            }
            ch <- localOcurrences
            wg.Done()
        }()
    }

    // Close the channel when all the goroutines are done
    go func() {
        wg.Wait()
        close(ch)
    }()

    // Acummulate the results of each experiment
    ocurrencesEvent := 0
    for localOcurrences := range ch {
        ocurrencesEvent += localOcurrences
    }

    return float64(ocurrencesEvent) / float64(numTrials)
}

To my surprise, when I run benchmarks on the two versions, I get that the sequential is faster than the concurrent one, with the concurrent version getting better as I decrease the number of goroutines. Why does this happen? I thought the concurrent version will be faster since this is a highly parallelizable problem.

Here is my benchmark’s code:

func tossEqualToSix() bool {
    // Simulate the toss of a six-sided die
    roll := rand.Intn(6) + 1

    if roll != 6 {
        return false
    }
    return true
}

const (
    numsSimBenchmark       = 1_000_000
    numGoroutinesBenckmark = 10
)

func BenchmarkSimulationSequential(b *testing.B) {
    for i := 0; i < b.N; i++ {
        simulationSequential(tossEqualToSix, numsSimBenchmark)
    }
}

func BenchmarkSimulationConcurrent(b *testing.B) {
    for i := 0; i < b.N; i++ {
        simulationConcurrent(tossEqualToSix, numsSimBenchmark, numGoroutinesBenckmark)
    }
}

And the results:

goos: linux
goarch: amd64
pkg: github.com/jpuriol/montecarlo
cpu: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
BenchmarkSimulationSequential-8               36          30453588 ns/op
BenchmarkSimulationConcurrent-8                9         117462720 ns/op
PASS
ok      github.com/jpuriol/montecarlo   2.478s

You can download my code from Github.

Solution

I thought I would elaborate on my comment and post it with code and benchmark result.

Examine function uses package level rand functions from rand package. These functions under the hood uses globalRand instance of rand.Rand. For example func Intn(n int) int { return globalRand.Intn(n) }. As the random number generator is not thread safe, the globalRand is instantiated in following way:

/*
 * Top-level convenience functions
 */

var globalRand = New(&lockedSource{src: NewSource(1).(*rngSource)})

type lockedSource struct {
    lk  sync.Mutex
    src *rngSource
}

func (r *lockedSource) Int63() (n int64) {
    r.lk.Lock()
    n = r.src.Int63()
    r.lk.Unlock()
    return
}
...

This means that all invocations of rand.Intn are guarded by the global lock. The consequence is that examine function "works sequentially", because of the lock. More specifically, each call to rand.Intn will not start generating a random number before the previous call completes.

Here is redesigned code. Each examine function has its own random generator. The assumption is that single examine function is used by one goroutine, so it does not require lock protection.

package main

import (
    "math/rand"
    "sync"
    "testing"
    "time"
)

func simulationSequential(experimentFuncFactory func() func() bool, numTrials int) float64 {
    experiment := experimentFuncFactory()
    ocurrencesEvent := 0

    for trial := 0; trial < numTrials; trial++ {
        eventHappend := experiment()
        if eventHappend {
            ocurrencesEvent++
        }
    }

    return float64(ocurrencesEvent) / float64(numTrials)
}

func simulationConcurrent(experimentFuncFactory func() func() bool, numTrials, nGoroutines int) float64 {
    ch := make(chan int)
    var wg sync.WaitGroup

    // Launch work in multiple goroutines
    for i := 0; i < nGoroutines; i++ {
        wg.Add(1)
        go func() {
            experiment := experimentFuncFactory()
            localOcurrences := 0
            for j := 0; j < numTrials/nGoroutines; j++ {
                eventHappend := experiment()
                if eventHappend {
                    localOcurrences++
                }
            }
            ch <- localOcurrences
            wg.Done()
        }()
    }

    // Close the channel when all the goroutines are done
    go func() {
        wg.Wait()
        close(ch)
    }()

    // Acummulate the results of each experiment
    ocurrencesEvent := 0
    for localOcurrences := range ch {
        ocurrencesEvent += localOcurrences
    }

    return float64(ocurrencesEvent) / float64(numTrials)
}

func tossEqualToSix() func() bool {
    prng := rand.New(rand.NewSource(time.Now().UnixNano()))
    return func() bool {
        // Simulate the toss of a six-sided die
        roll := prng.Intn(6) + 1

        if roll != 6 {
            return false
        }
        return true
    }
}

const (
    numsSimBenchmark       = 5_000_000
    numGoroutinesBenchmark = 10
)

func BenchmarkSimulationSequential(b *testing.B) {
    for i := 0; i < b.N; i++ {
        simulationSequential(tossEqualToSix, numsSimBenchmark)
    }
}

func BenchmarkSimulationConcurrent(b *testing.B) {
    for i := 0; i < b.N; i++ {
        simulationConcurrent(tossEqualToSix, numsSimBenchmark, numGoroutinesBenchmark)
    }
}

Benchmark result are as follows:

goos: darwin
goarch: arm64
pkg: scratchpad
BenchmarkSimulationSequential-8           20      55142896 ns/op
BenchmarkSimulationConcurrent-8           82      12944360 ns/op

Answered By – Jaroslaw

Answer Checked By – Pedro (GoLangFix Volunteer)

Leave a Reply

Your email address will not be published.