How do I compare a Unicode string that has different bytes, but the same value?
I'm comparing Unicode strings between JSON objects.
They have the same value:
a = '人口じんこうに膾炙かいしゃする' b = '人口じんこうに膾炙かいしゃする'
But they have different Unicode representations:
String a : u'\u4eba\u53e3\u3058\u3093\u3053\u3046\u306b\u81be\u7099\u304b\u3044\u3057\u3083\u3059\u308b' String b : u'\u4eba\u53e3\u3058\u3093\u3053\u3046\u306b\u81be\uf9fb\u304b\u3044\u3057\u3083\u3059\u308b'
How can I compare between two Unicode strings on their value?
Unicode normalization will get you there for this one:
>>> import unicodedata >>> unicodedata.normalize("NFC", "\uf9fb") == "\u7099" True
unicodedata.normalize on both of your strings before comparing them with
== to check for canonical Unicode equivalence.
U+F9FB is a "CJK Compatibility" character. These characters decompose into one or more regular CJK characters when normalized.
★ Back to homepage or read more recommendations: