Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence

sample code:

    >>> import json
    >>> json_string = json.dumps("ברי צקלה")
    >>> print json_string
    "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"

The problem: it's not human readable. My (smart) users want to verify or even edit text files with JSON dumps. (and i'd rather not use XML)

Is there a way to serialize objects into utf-8 json string (instead of \uXXXX ) ?

this doesn't help:

    >>> output = json_string.decode('string-escape')
    "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"

this works, but if any sub-objects is a python-unicode and not utf-8, it'll dump garbage:

    >>> #### ok:
    >>> s= json.dumps( "ברי צקלה", ensure_ascii=False)    
    >>> print json.loads(s)   
    ברי צקלה

    >>> #### NOT ok:
    >>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
    >>> print d
    {1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 
     2: u'\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94'}
    >>> s = json.dumps( d, ensure_ascii=False, encoding='utf8')
    >>> print json.loads(s)['1']
    ברי צקלה
    >>> print json.loads(s)['2']
    ××¨× ×¦×§××

Use the ensure_ascii=False switch to json.dumps(), then encode the value to UTF-8 manually:

    >>> json_string = json.dumps(u"ברי צקלה", ensure_ascii=False).encode('utf8')
    >>> json_string
    '"\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94"'
    >>> print json_string
    "ברי צקלה"

If you are writing this to a file, you can use io.open() instead of open() to produce a file object that encodes Unicode values for you as you write, then use json.dump() instead to write to that file:

    with io.open('filename', 'w', encoding='utf8') as json_file:
        json.dump(u"ברי צקלה", json_file, ensure_ascii=False)

In Python 3, the built-in open() is an alias for io.open(). Do note that there is a bug in the json module where the ensure_ascii=False flag can produce a mix of unicode and str objects. The workaround for Python 2 then is:

    with io.open('filename', 'w', encoding='utf8') as json_file:
        data = json.dumps(u"ברי צקלה", ensure_ascii=False)
        # unicode(data) auto-decodes data to unicode if str
        json_file.write(unicode(data))

If you are passing in byte strings (type str in Python 2, bytes in Python 3) encoded to UTF-8, make sure to also set the encoding keyword:

    >>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
    >>> d
    {1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 2: u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'}

    >>> s=json.dumps(d, ensure_ascii=False, encoding='utf8')
    >>> s
    u'{"1": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4", "2": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"}'
    >>> json.loads(s)['1']
    u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
    >>> json.loads(s)['2']
    u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
    >>> print json.loads(s)['1']
    ברי צקלה
    >>> print json.loads(s)['2']
    ברי צקלה

Note that your second sample is not valid Unicode; you gave it UTF-8 bytes as a unicode literal, that would never work:

    >>> s = u'\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94'
    >>> print s
    ××¨× ×¦×§××
    >>> print s.encode('latin1').decode('utf8')
    ברי צקלה

Only when I encoded that string to Latin 1 (whose unicode codepoints map one-to-one to bytes) then decode as UTF-8 do you see the expected output. That has nothing to do with JSON and everything to do with that you use the wrong input. The result is called a Mojibake.

If you got that Unicode value from a string literal, it was decoded using the wrong codec. It could be your terminal is mis-configured, or that your text editor saved your source code using a different codec than what you told Python to read the file with. Or you sourced it from a library that applied the wrong codec. This all has nothing to do with the JSON library.

From: stackoverflow.com/q/18337407

Back to homepage or read more recommendations: