Loop through all files in all folders recursively as fast as possible in GOLANG

Issue

I’m facing a problem that even after spending the day on the forums I still can’t quite understand and solve.

So here it is, I made a function that loops over all the folders as well as its sub-folders, and which has 2 sub-functions:
– For each file found, list the name of the file.
– For each folder found, restart the same parent function to find the child files and folders again.

To make it simpler, the macro lists all files in a tree with recursion. But my goal is to do it as fast as possible so I run a new goroutine every time I come across a new folder.

PROBLEM:
My problem is that when the tree structure is too large (too many folders in folders and subfolders…) the script generates too many threads and therefore gives me an error. So I increased this limit, but suddenly it’s the pc that no longer wants :/

So my question is, how can I make a worker system (with poolsize) that fits my code?
No matter how much I look, I don’t see how to say, for example, to generate new goroutines up to a certain limit, the time to empty the buffer.


Source code:
https://github.com/LaM0uette/FilesDIR/tree/V0.5

main:

package main

import (
    "FilesDIR/globals"
    "FilesDIR/task"
    "fmt"
    "log"
    "runtime/debug"
    "sync"
    "time"
)

func main() {
    timeStart := time.Now()
    debug.SetMaxThreads(5 * 1000)

    var wg sync.WaitGroup

    // task.DrawStart()

    /*
        err := task.LoopDir(globals.SrcPath)
        if err != nil {
            log.Print(err.Error())
        }
    */

    err := task.LoopDirsFiles(globals.SrcPath, &wg) // globals.SrcPath = My path with ~2000000 files ( this is a serveur of my entreprise)
    if err != nil {
        log.Print(err.Error())
    }

    wg.Wait()

    fmt.Println("FINI: Nb Fichiers: ", task.Id)

    timeEnd := time.Since(timeStart)
    fmt.Println(timeEnd)
}

task:

package task

import (
    "fmt"
    "io/ioutil"
    "log"
    "os"
    "path/filepath"
    "strings"
    "sync"
    "time"
)

var Id = 0

// LoopDir TODO: Code à supprimer / Code to delete
func LoopDir(path string) error {
    var wg sync.WaitGroup

    countDir := 0

    err := filepath.Walk(path, func(path string, info os.FileInfo, err error) error {
        if err != nil {
            return err
        }

        if info.IsDir() {
            wg.Add(1)
            countDir++

            go func() {
                err := loopFiles(path, &wg)
                if err != nil {
                    log.Println(err.Error())
                }
            }()
        }

        return nil
    })
    if err != nil {
        return err
    }

    wg.Wait()
    fmt.Println("Finished", countDir, Id)
    return nil
}

// loopFiles TODO: Code à supprimer / Code to delete
func loopFiles(path string, wg *sync.WaitGroup) error {

    files, err := ioutil.ReadDir(path)
    if err != nil {
        wg.Done()
        return err
    }

    for _, file := range files {
        if !file.IsDir() {
            go fmt.Println(file.Name())
            Id++
        }
    }

    wg.Done()
    return nil
}

func LoopDirsFiles(path string, wg *sync.WaitGroup) error {
    wg.Add(1)
    defer wg.Done()

    files, err := ioutil.ReadDir(path)
    if err != nil {
        return err
    }

    for _, file := range files {
        if !file.IsDir() && !strings.Contains(file.Name(), "~") {
            fmt.Println(file.Name(), Id)
            Id++
        } else if file.IsDir() {
            go func() {
                err = LoopDirsFiles(filepath.Join(path, file.Name()), wg)
                if err != nil {
                    log.Print(err)
                }
            }()
            time.Sleep(20 * time.Millisecond)
        }
    }
    return nil
}

Solution

If you don’t want to use any external package, you can create a separate worker routine for file processing, then start as many workers you want. After that, go into the tree recursively in your main thread, and send out the jobs to the workers. If any worker "has time", it will pick up the following job from the jobs channel and process it.

var (
    wg   *sync.WaitGroup
    jobs chan string = make(chan string)
)

func loopFilesWorker() error {
    for path := range jobs {
        files, err := ioutil.ReadDir(path)
        if err != nil {
            wg.Done()
            return err
        }

        for _, file := range files {
            if !file.IsDir() {
                fmt.Println(file.Name())
            }
        }
        wg.Done()
    }
    return nil
}

func LoopDirsFiles(path string) error {
    files, err := ioutil.ReadDir(path)
    if err != nil {
        return err
    }
    //Add this path as a job to the workers
    //You must call it in a go routine, since if every worker is busy, then you have to wait for the channel to be free.
    go func() {
        wg.Add(1)
        jobs <- path
    }()
    for _, file := range files {
        if file.IsDir() {
            //Recursively go further in the tree
            LoopDirsFiles(filepath.Join(path, file.Name()))
        }
    }
    return nil
}

func main() {
    //Start as many workers you want, now 10 workers
    for w := 1; w <= 10; w++ {
        go loopFilesWorker()
    }
    //Start the recursion
    LoopDirsFiles(globals.SrcPath)
    wg.Wait()
}

Answered By – Fenistil

Answer Checked By – Marilyn (GoLangFix Volunteer)

Leave a Reply

Your email address will not be published.