Compare the speed of grep with Python regexes
At one of my client we had a Bash script that grepped a huge log file 20 times in order to generate a report. It created a lot of load on the server as grep was reading the entire file 20 times.
As we were converting our Shell scripts to Python anyway I thought I could rewrite it in Python and go over the file once instead of 20 times and use the Regex engine of Python to extract the same information.
The Python version should be faster as we all know file I/O is way more expensive than in-memory operations.
After starting conversion it turned out to be incorrect. Our code became way slower. Let's see a simulation of it.
Generate the big log file
In order to make it easy to reproduce the case I created a script that could create a big text file.
examples/python/create-big-file.py
import sys import random if len(sys.argv) != 4: exit(f"{sys.argv[0]} FILENAME NUMBER-OF-ROWS LENGTH-OF-ROWS") _, filename, rows, length = sys.argv line = "x" * int(length) + "\n" match = random.randint(0, int(length)) with open(filename, 'w') as fh: for i in range(int(rows)): if i == match: fh.write("x" * (int(length)-2) + "yx\n") else: fh.write(line)
We can run it like this, indicating the name of the file we would like to create, the number of rows and the length of rows.
python create-big-file.py FILENAME NUMBER-OF-ROWS LENGTH-OF-ROWS
It will create a file full of the character "x", with a single "y" somewhere.
I think this is going to be good enough for our simple example.
Using grep
In the original shell script we had some 20 different calls to grep, but to make it simpler I made this shell script with that runs the same regex multiple times.
examples/grep_speed.sh
filename=$1 limit=$2 for ((i=1;i<=$limit;i++)); do grep y $filename done
You can pass the name of the data file and the number of time you'd like to run grep.
Grep with Python regexes
I have an implementation in Python as well.
examples/grep_speed.py
import sys import re if len(sys.argv) != 3: exit(f"{sys.argv[0]} FILENAME LIMIT") _, filename, limit = sys.argv with open(filename) as fh: for line in fh: for _ in range(int(limit)): if re.search(r'y', line): print(line)
I know in the simple case of finding a single "y" character I could use the index method or the find method and thous would be probably faster, but in our cases we really had more complex regexes.
Comparing the speed
python create-big-file.py a.txt 100000 50
Verify the file:
$ wc a.txt 1000000 1000000 51000000 a.txt
# grep y a.txt xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyx
$ time bash examples/grep_speed.sh a.txt 20 real 0m0.227s user 0m0.055s sys 0m0.172s
$ time python examples/grep_speed.py a.txt 20 real 0m9.509s user 0m9.477s sys 0m0.032s
grep is about 50 times faster than Python even though grep had to read the file 20 time while Python only read it once.
More complex grep
In the previous case we used a very simple regex, now let's change it to use a slightly more complex expression in which we are not only looking for a single character, but we also want to make sure it is between two identical characters.
examples/grep_speed_oxo.sh
filename=$1 limit=$2 for ((i=1;i<=$limit;i++)); do grep '\(.\)y\1' $filename done
More complex python
examples/grep_speed_oxo.py
import sys import re if len(sys.argv) != 3: exit(f"{sys.argv[0]} FILENAME LIMIT") _, filename, limit = sys.argv with open(filename) as fh: for line in fh: for _ in range(int(limit)): if re.search(r'(.)y\1', line): print(line)
You can try it yourself:
grep '\(.\)y\1' a.txt
Comparing the speed of the more complex examples
$ time bash examples/grep_speed_oxo.sh a.txt 20 real 0m0.196s user 0m0.035s sys 0m0.161s
$ time python examples/grep_speed_oxo.py a.txt 20 real 0m25.067s user 0m24.972s sys 0m0.016s
The speed of grep did not change, but Python became even slower. This time grep is more than a 100 times faster than Python.
Version information
$ python -V Python 3.8.2
$ grep -V grep (GNU grep) 3.4
Other cases
The results are consistent with what I saw during my work, but I wonder what would be the results if the file was larger than the available memory in my computer.
Conclusion
grep is so much faster than the regex engine of Python that even reading the whole file several times does not matter.
Or I made a mistake somewhere that impacts the results.
Oh and one more thing, I also create a Perl version of the code and Perl is much faster than Python even though it is also slower than the grep code.
Published on 2020-07-01