Multiprocess N files: Pool


In this example we "analyze" files by counting how many characters they have, how many digits, and how many spaces.

Analyze N files in parallel.


examples/multiprocess/multiprocess_files.py
import multiprocessing as mp
import os
import sys

def analyze(filename):
    print("Process {:>5} analyzing {}".format(os.getpid(), filename))
    digits = 0
    letters = 0
    spaces = 0
    other = 0
    total  = 0
    with open(filename) as fh:
        for line in fh:
            for char in line:
                total += 1
                if char.isdigit():
                    digits += 1
                    break
                if char.isalnum():
                    letters += 1
                    break
                if char == ' ':
                    spaces += 1
                    break
                other += 1
    return {
        'filename': filename,
        'total': total,
        'digits': digits,
        'spaces': spaces,
        'letters': letters,
        'other': other,
    }

def main():
    if len(sys.argv) < 3:
        exit(f"Usage: {sys.argv[0]} POOL_SIZE FILEs")
    size  = int(sys.argv[1])
    files = sys.argv[2:]

    with mp.Pool(size) as pool:
        results = pool.map(analyze, files)
    for res in results:
        print(res)

if __name__ == '__main__':
    main()

$ python multiprocess_files.py 3 multiprocess_*.py


Process 12093 analyzing multiprocess_files.py
Process 12093 analyzing multiprocess_pool_async.py
Process 12095 analyzing multiprocess_load.py
Process 12094 analyzing multiprocessing_and_logging.py
Process 12094 analyzing multiprocess_pool.py
{'filename': 'multiprocess_files.py', 'total': 47, 'digits': 0, 'spaces': 37, 'letters': 6, 'other': 4}
{'filename': 'multiprocessing_and_logging.py', 'total': 45, 'digits': 0, 'spaces': 27, 'letters': 11, 'other': 7}
{'filename': 'multiprocess_load.py', 'total': 32, 'digits': 0, 'spaces': 20, 'letters': 7, 'other': 5}
{'filename': 'multiprocess_pool_async.py', 'total': 30, 'digits': 0, 'spaces': 16, 'letters': 6, 'other': 8}
{'filename': 'multiprocess_pool.py', 'total': 21, 'digits': 0, 'spaces': 11, 'letters': 6, 'other': 4}

We asked it to use 3 processes, so looking at the process ID you can see one of them worked twice. The returned results can be any Python datastructure. A dictionary is usually a good idea.