How do I compare a Unicode string that has different bytes, but the same value?

I'm comparing Unicode strings between JSON objects.

They have the same value:

    a = '人口じんこうに膾炙かいしゃする'
    b = '人口じんこうに膾炙かいしゃする'

But they have different Unicode representations:

    String a : u'\u4eba\u53e3\u3058\u3093\u3053\u3046\u306b\u81be\u7099\u304b\u3044\u3057\u3083\u3059\u308b'
    String b : u'\u4eba\u53e3\u3058\u3093\u3053\u3046\u306b\u81be\uf9fb\u304b\u3044\u3057\u3083\u3059\u308b'

How can I compare between two Unicode strings on their value?

Unicode normalization will get you there for this one:

    >>> import unicodedata
    >>> unicodedata.normalize("NFC", "\uf9fb") == "\u7099"
    True

Use unicodedata.normalize on both of your strings before comparing them with == to check for canonical Unicode equivalence.

Character U+F9FB is a "CJK Compatibility" character. These characters decompose into one or more regular CJK characters when normalized.

From: stackoverflow.com/q/49662585

Back to homepage or read more recommendations: