Python

Caching results to speed up process in Python

levenshtein

Prev Next

If you need to calculate the same thing multiple times it is often better to use in-memory caching to eliminate some of the processing. Sometimes the processing part involves disk- or networkaccess that makes it slow. Sometime it is "just" using the CPU.

Memoization is usually the name of the process and one can write a generic Memoizer if that's needed. In this example we'll do the caching manually. We'll look at two implementation calculating the Levenshtein distance of strings.

Pylev

In the first example we used the pylev module. This is our script:

examples/python/levenshtein_pylev.py

import pylev
import timeit
import sys

def non_cached(a, b):
    return pylev.levenshtein(a, b)

def cached(a, b):
    if 'data' not in cached.__dict__:
        cached.data = {}
    k = (a,b)
    if k not in cached.data:
        cached.data[k] = pylev.levenshtein(a, b)
    return cached.data[k]

if len(sys.argv) != 2:
    exit("Need 1 argument, number of iterations")

n = int(sys.argv[1])
a = 'fsffvfdsbbdfvvdavavavavavava'
b = 'fvdaabavvvvvadvdvavavadfsfsdafvvav'


print(timeit.timeit(lambda : non_cached(a, b), number=n))
print(timeit.timeit(lambda : cached(a, b), number=n))

The non-cached version is just a function wrapping the call to pylev.levenshtein

The cached version usses the dictionary representing the wrapper function cached to store the cached results.

This snippet checks if we already have a key called 'data' in that dictionary, and creates one if there was no data yet. This would only happen the first time we call the 'cached' function.

    if 'data' not in cached.__dict__:
        cached.data = {}

Then we create a key using the input parameters in a tuple. In the case of the Levenshtein distance the order of the parameters does not matter and thus we can theoretically save some memory if we sort the parameters first, but in the more generic case we might need to stick to the order. So I'll do that here as well.

    k = (a,b)

Then, if the key is not in the caching dictionary yet, we call the real Levenshtein function and store the result in the cache.

    if k not in cached.data:
        cached.data[k] = pylev.levenshtein(a, b)

Finally we return the value from the cache. Regardless wheteher it was calculated now, or in one of the earlier calls.

    return cached.data[k]

I picked two rather random strings to compare and we use timeit to run the function several times and measure the elapsed time.

The actualy number of repetition is received from the command line. This allows us to run the file with varius numbers like this: python levenshtein_pylev.py 42.

Here are the results for some numbers:

 N non_cached               cachced
 1 0.0005512761417776346    0.000554962083697319
10 0.009959910064935684    0.000820280984044075
100 0.06429028487764299 0.000595971941947937
1000 0.5230045560747385 0.0012550328392535448
10000 5.692702914122492 0.009364967932924628

Editdistance

In the second example we use the editdistance module.

examples/python/levenshtein_editdistance.py

import editdistance
import timeit
import sys

def non_cached(a, b):
    return editdistance.eval(a, b)

def cached(a, b):
    if 'data' not in cached.__dict__:
        cached.data = {}
    if (a,b) not in cached.data:
        cached.data[(a, b)] = editdistance.eval(a, b)
    return cached.data[(a, b)]

if len(sys.argv) != 2:
    exit("Need 1 argument, number of iterations")

n = int(sys.argv[1])
a = 'fsffvfdsbbdfvvdavavavavavava'
b = 'fvdaabavvvvvadvdvavavadfsfsdafvvav'


print(timeit.timeit(lambda : non_cached(a, b), number=n))
print(timeit.timeit(lambda : cached(a, b), number=n))

Prev Next

Written by
Gabor Szabo

Published on 2019-04-16

If you have any comments or questions, feel free to post them on the source of this page in GitHub. Source on GitHub. Comment on this post

Author: Gabor Szabo

Gábor who writes the articles of the Code Maven site offers courses in in the subjects that are discussed on this web site.

Gábor helps companies set up test automation, CI/CD Continuous Integration and Continuous Delivery and other DevOps related systems. Gabor can help your team improve the development speed and reduce the risk of bugs.

He is also the author of a number of eBooks.

Contact Gabor if you'd like to hire his services.

If you would like to support his freely available work, you can do it via Patreon, GitHub, or PayPal.