Speed up millions of regex replacements in Python 3

I'm using Python 3.5.2

I have two lists

  • a list of about 750,000 "sentences" (long strings)
  • a list of about 20,000 "words" that I would like to delete from my 750,000 sentences

So, I have to loop through 750,000 sentences and perform about 20,000 replacements, but ONLY if my words are actually "words" and are not part of a larger string of characters.

I am doing this by pre-compiling my words so that they are flanked by the \b metacharacter

    compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]

Then I loop through my "sentences"

    import re

    for sentence in sentences:
      for word in compiled_words:
        sentence = re.sub(word, "", sentence)
      # put sentence into a growing list

This nested loop is processing about 50 sentences per second , which is nice, but it still takes several hours to process all of my sentences.

  • Is there a way to using the str.replace method (which I believe is faster), but still requiring that replacements only happen at word boundaries?

  • Alternatively, is there a way to speed up the re.sub method? I have already improved the speed marginally by skipping over re.sub if the length of my word is > than the length of my sentence, but it's not much of an improvement.

Thank you for any suggestions.

One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b".

Because re relies on C code to do the actual matching, the savings can be dramatic.

As @pvg pointed out in the comments, it also benefits from single pass matching.

If your words are not regex, Eric's answer is faster.

From: stackoverflow.com/q/42742810

Back to homepage or read more recommendations: