Replace non-ASCII characters with a single space
I need to replace all non-ASCII (\x00-\x7F) characters with a space. I'm surprised that this is not dead-easy in Python, unless I'm missing something. The following function simply removes all non-ASCII characters:
def remove_non_ascii_1(text): return ''.join(i for i in text if ord(i)<128)
And this one replaces non-ASCII characters with the amount of spaces as per the amount of bytes in the character code point (i.e. the
– character is replaced with 3 spaces):
def remove_non_ascii_2(text): return re.sub(r'[^\x00-\x7F]',' ', text)
How can I replace all non-ASCII characters with a single space?
''.join() expression is filtering , removing anything non-ASCII; you could use a conditional expression instead:
return ''.join([i if ord(i) < 128 else ' ' for i in text])
This handles characters one by one and would still use one space per character replaced.
Your regular expression should just replace consecutive non-ASCII characters with a space:
re.sub(r'[^\x00-\x7F]+',' ', text)