Fix Python – Speed up millions of regex replacements in Python 3

Question

Asked By – pdanese

I have two lists:

  • a list of about 750K “sentences” (long strings)
  • a list of about 20K “words” that I would like to delete from my 750K sentences

So, I have to loop through 750K sentences and perform about 20K replacements, but ONLY if my words are actually “words” and are not part of a larger string of characters.

I am doing this by pre-compiling my words so that they are flanked by the \b word-boundary metacharacter:

compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]

Then I loop through my “sentences”:

import re

for sentence in sentences:
  for word in compiled_words:
    sentence = re.sub(word, "", sentence)
  # put sentence into a growing list

This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.

  • Is there a way to using the str.replace method (which I believe is faster), but still requiring that replacements only happen at word boundaries?

  • Alternatively, is there a way to speed up the re.sub method? I have already improved the speed marginally by skipping over re.sub if the length of my word is > than the length of my sentence, but it’s not much of an improvement.

I’m using Python 3.5.2

Now we will see solution for issue: Speed up millions of regex replacements in Python 3


Answer

One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b".

Because re relies on C code to do the actual matching, the savings can be dramatic.

As @pvg pointed out in the comments, it also benefits from single pass matching.

If your words are not regex, Eric’s answer is faster.

This question is answered By – Liteye

This answer is collected from stackoverflow and reviewed by FixPython community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0