Speed up millions of regex replacements in Python 3
I'm using Python 3.5.2
I have two lists
- a list of about 750,000 "sentences" (long strings)
- a list of about 20,000 "words" that I would like to delete from my 750,000 sentences
So, I have to loop through 750,000 sentences and perform about 20,000 replacements, but ONLY if my words are actually "words" and are not part of a larger string of characters.
I am doing this by pre-compiling my words so that they are flanked by the
compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]
Then I loop through my "sentences"
import re for sentence in sentences: for word in compiled_words: sentence = re.sub(word, "", sentence) # put sentence into a growing list
This nested loop is processing about 50 sentences per second , which is nice, but it still takes several hours to process all of my sentences.
Is there a way to using the
str.replacemethod (which I believe is faster), but still requiring that replacements only happen at word boundaries?
Alternatively, is there a way to speed up the
re.submethod? I have already improved the speed marginally by skipping over
re.subif the length of my word is > than the length of my sentence, but it's not much of an improvement.
Thank you for any suggestions.
One thing you can try is to compile one single pattern like
re relies on C code to do the actual matching, the savings can be dramatic.
As @pvg pointed out in the comments, it also benefits from single pass matching.
If your words are not regex, Eric's answer is faster.
★ Back to homepage or read more recommendations: