Question
Asked By – pdanese
I have two lists:
- a list of about 750K “sentences” (long strings)
- a list of about 20K “words” that I would like to delete from my 750K sentences
So, I have to loop through 750K sentences and perform about 20K replacements, but ONLY if my words are actually “words” and are not part of a larger string of characters.
I am doing this by pre-compiling my words so that they are flanked by the \b
word-boundary metacharacter:
compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]
Then I loop through my “sentences”:
import re
for sentence in sentences:
for word in compiled_words:
sentence = re.sub(word, "", sentence)
# put sentence into a growing list
This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.
-
Is there a way to using the
str.replace
method (which I believe is faster), but still requiring that replacements only happen at word boundaries? -
Alternatively, is there a way to speed up the
re.sub
method? I have already improved the speed marginally by skipping overre.sub
if the length of my word is > than the length of my sentence, but it’s not much of an improvement.
I’m using Python 3.5.2
Now we will see solution for issue: Speed up millions of regex replacements in Python 3
Answer
One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b"
.
Because re
relies on C code to do the actual matching, the savings can be dramatic.
As @pvg pointed out in the comments, it also benefits from single pass matching.
If your words are not regex, Eric’s answer is faster.
This question is answered By – Liteye
This answer is collected from stackoverflow and reviewed by FixPython community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0