At one of my client we had a Bash script that grepped a huge log file 20 times in order to generate a report. It created a lot of load on the server as grep was reading the entire file 20 times.

As we were converting our Shell scripts to Python anyway I thought I could rewrite it in Python and go over the file once instead of 20 times and use the Regex engine of Python to extract the same information.

The Python version should be faster as we all know file I/O is way more expensive than in-memory operations.

After starting conversion it turned out to be incorrect. Our code became way slower. Let's see a simulation of it.

Generate the big log file

In order to make it easy to reproduce the case I created a script that could create a big text file.

examples/python/create-big-file.py

import sys
import random

if len(sys.argv) != 4:
    exit(f"{sys.argv[0]} FILENAME NUMBER-OF-ROWS LENGTH-OF-ROWS")

_, filename, rows, length = sys.argv

line = "x" * int(length) + "\n"

match = random.randint(0, int(length))

with open(filename, 'w') as fh:
    for i in range(int(rows)):
        if i == match:
            fh.write("x" * (int(length)-2) + "yx\n")
        else:
            fh.write(line)

We can run it like this, indicating the name of the file we would like to create, the number of rows and the length of rows.

python create-big-file.py FILENAME NUMBER-OF-ROWS LENGTH-OF-ROWS

It will create a file full of the character "x", with a single "y" somewhere.

I think this is going to be good enough for our simple example.

Using grep

In the original shell script we had some 20 different calls to grep, but to make it simpler I made this shell script with that runs the same regex multiple times.

examples/grep_speed.sh

filename=$1
limit=$2

for ((i=1;i<=$limit;i++));
do
    grep y $filename
done

You can pass the name of the data file and the number of time you'd like to run grep.

Grep with Python regexes

I have an implementation in Python as well.

examples/grep_speed.py

import sys
import re

if len(sys.argv) != 3:
    exit(f"{sys.argv[0]} FILENAME LIMIT")

_, filename, limit = sys.argv

with open(filename) as fh:
    for line in fh:
        for _ in range(int(limit)):
            if re.search(r'y', line):
                print(line)

I know in the simple case of finding a single "y" character I could use the index method or the find method and thous would be probably faster, but in our cases we really had more complex regexes.

Comparing the speed

python create-big-file.py a.txt 100000 50

Verify the file:

$ wc a.txt
 1000000  1000000 51000000 a.txt

# grep y a.txt
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyx

$ time bash examples/grep_speed.sh a.txt 20

real  0m0.227s
user  0m0.055s
sys   0m0.172s

$ time python examples/grep_speed.py a.txt 20

real  0m9.509s
user  0m9.477s
sys   0m0.032s

grep is about 50 times faster than Python even though grep had to read the file 20 time while Python only read it once.

More complex grep

In the previous case we used a very simple regex, now let's change it to use a slightly more complex expression in which we are not only looking for a single character, but we also want to make sure it is between two identical characters.

examples/grep_speed_oxo.sh

filename=$1
limit=$2

for ((i=1;i<=$limit;i++));
do
    grep '\(.\)y\1' $filename
done

More complex python

examples/grep_speed_oxo.py

import sys
import re

if len(sys.argv) != 3:
    exit(f"{sys.argv[0]} FILENAME LIMIT")

_, filename, limit = sys.argv

with open(filename) as fh:
    for line in fh:
        for _ in range(int(limit)):
            if re.search(r'(.)y\1', line):
                print(line)

You can try it yourself:

grep '\(.\)y\1' a.txt

Comparing the speed of the more complex examples

$ time bash examples/grep_speed_oxo.sh a.txt 20

real   0m0.196s
user   0m0.035s
sys    0m0.161s

$ time python examples/grep_speed_oxo.py a.txt 20

real   0m25.067s
user   0m24.972s
sys    0m0.016s

The speed of grep did not change, but Python became even slower. This time grep is more than a 100 times faster than Python.

Version information

$ python -V
Python 3.8.2

$ grep -V
grep (GNU grep) 3.4

Other cases

The results are consistent with what I saw during my work, but I wonder what would be the results if the file was larger than the available memory in my computer.

Conclusion

grep is so much faster than the regex engine of Python that even reading the whole file several times does not matter.

Or I made a mistake somewhere that impacts the results.

Oh and one more thing, I also create a Perl version of the code and Perl is much faster than Python even though it is also slower than the grep code.