Asked By – pdanese
I have two lists:
- a list of about 750K “sentences” (long strings)
- a list of about 20K “words” that I would like to delete from my 750K sentences
So, I have to loop through 750K sentences and perform about 20K replacements, but ONLY if my words are actually “words” and are not part of a larger string of characters.
I am doing this by pre-compiling my words so that they are flanked by the
\b word-boundary metacharacter:
compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]
Then I loop through my “sentences”:
import re for sentence in sentences: for word in compiled_words: sentence = re.sub(word, "", sentence) # put sentence into a growing list
This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.
Is there a way to using the
str.replacemethod (which I believe is faster), but still requiring that replacements only happen at word boundaries?
Alternatively, is there a way to speed up the
re.submethod? I have already improved the speed marginally by skipping over
re.subif the length of my word is > than the length of my sentence, but it’s not much of an improvement.
I’m using Python 3.5.2
Now we will see solution for issue: Speed up millions of regex replacements in Python 3
One thing you can try is to compile one single pattern like
re relies on C code to do the actual matching, the savings can be dramatic.
As @pvg pointed out in the comments, it also benefits from single pass matching.
If your words are not regex, Eric’s answer is faster.