UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128)

I'm using NLTK to perform kmeans clustering on my text file in which each line is considered as a document. So for example, my text file is something like this:

belong finger death punch
hasty
mike hasty walls jericho
jägermeister rules
rules bands follow performing jägermeister stage
approach

Now the demo code I'm trying to run is this: https://gist.github.com/xim/1279283

The error I receive is this:

    Traceback (most recent call last):
    File "cluster_example.py", line 40, in
    words = get_words(job_titles)
    File "cluster_example.py", line 20, in get_words
    words.add(normalize_word(word))
    File "", line 1, in
    File "/usr/local/lib/python2.7/dist-packages/nltk/decorators.py", line 183, in memoize
    result = func(*args)
    File "cluster_example.py", line 14, in normalize_word
    return stemmer_func(word.lower())
    File "/usr/local/lib/python2.7/dist-packages/nltk/stem/snowball.py", line 694, in stem
    word = (word.replace(u"\u2019", u"\x27")
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128)

What is happening here?

The file is being read as a bunch of strs, but it should be unicodes. Python tries to implicitly convert, but fails. Change:

    job_titles = [line.strip() for line in title_file.readlines()]

to explicitly decode the strs to unicode (here assuming UTF-8):

    job_titles = [line.decode('utf-8').strip() for line in title_file.readlines()]

It could also be solved by importing the codecs module and using codecs.open rather than the built-in open.

From: stackoverflow.com/q/18649512