Saving UTF-8 texts with json.dumps as UTF-8, not as a \u escape sequence

Question

Sample code (in a REPL):

import json
json_string = json.dumps("ברי צקלה")
print(json_string)

Output:

"\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"

The problem: it's not human readable. My (smart) users want to verify or even edit text files with JSON dumps (and I’d rather not use XML).

Is there a way to serialize objects into UTF-8 JSON strings (instead of \uXXXX)?

This is often used as a duplicate for questions where the proper answer is really "JSON is a format for information interchange, not for human consumption; forcing it to use UTF-8 will not work in all cases, and you should probably not try". — tripleee, Mar 10 '23 at 06:23
You've got Hebrew there. Specifically for RTL languages like Hebrew, this is a bad idea, because of how Unicode handles bidirectional text. What do you think are the keys and associated values in `{"a": "ב", "בב": "b"}`? It's not what it looks like - `"a"` is a key with value `"ב"`, and `"בב"` is a key with value `"b"`. How about ``{"a ב": "בב"}``? There, the key is `"a ב"` and the value is `"בב"`. People are *not* going to reliably get this right. — user2357112, Aug 04 '23 at 21:06
@user2357112 this is a matter of text viewer / editor. some editors work well with it, some not. The software doesn't care. So it's irrelevant; There are even worse cases of usability e.g. empty space characters, or control characters. All valid JSON dict keys, and ... well, that's not the issue. — Berry Tsakala, Aug 05 '23 at 13:30
"There are even worse cases of usability e.g. empty space characters, or control characters." - normal ASCII spaces are fine. If someone's got control characters and weird non-ASCII spaces, those are just more reasons for them not to do what you're looking for, but you posted Hebrew, so I focused on the problems that specifically arise with Hebrew. — user2357112, Aug 05 '23 at 17:26

Martijn Pieters · Accepted Answer · 2019-09-08T09:01:38.663

Use the ensure_ascii=False switch to json.dumps(), then encode the value to UTF-8 manually:

>>> json_string = json.dumps("ברי צקלה", ensure_ascii=False).encode('utf8')
>>> json_string
b'"\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94"'
>>> print(json_string.decode())
"ברי צקלה"

If you are writing to a file, just use json.dump() and leave it to the file object to encode:

with open('filename', 'w', encoding='utf8') as json_file:
    json.dump("ברי צקלה", json_file, ensure_ascii=False)

Caveats for Python 2

For Python 2, there are some more caveats to take into account. If you are writing this to a file, you can use io.open() instead of open() to produce a file object that encodes Unicode values for you as you write, then use json.dump() instead to write to that file:

with io.open('filename', 'w', encoding='utf8') as json_file:
    json.dump(u"ברי צקלה", json_file, ensure_ascii=False)

Do note that there is a bug in the json module where the ensure_ascii=False flag can produce a mix of unicode and str objects. The workaround for Python 2 then is:

with io.open('filename', 'w', encoding='utf8') as json_file:
    data = json.dumps(u"ברי צקלה", ensure_ascii=False)
    # unicode(data) auto-decodes data to unicode if str
    json_file.write(unicode(data))

In Python 2, when using byte strings (type str), encoded to UTF-8, make sure to also set the encoding keyword:

>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
>>> d
{1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 2: u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'}

>>> s=json.dumps(d, ensure_ascii=False, encoding='utf8')
>>> s
u'{"1": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4", "2": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"}'
>>> json.loads(s)['1']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> json.loads(s)['2']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> print json.loads(s)['1']
ברי צקלה
>>> print json.loads(s)['2']
ברי צקלה

The roundtrip `encode`/`decode` doesn't seem to be necessary. Just setting `ensure_ascii=False` (as per [this answer](https://stackoverflow.com/a/40585572/2025495)) seems to be enough. — AdamAL, Jan 11 '21 at 19:49
@AdamAL please read my answer more thoroughly: there is no round trip in this answer, apart from a decode call that’s only there to demonstrate that the bytes value indeed contains UTF-8 encoded data. The second code snippet in my answer writes directly to a file, only setting `ensure_ascii=False`. Note: I strongly recommend against using the `codecs.open()` function; the library predates `io` and the stream implementations have a lot of unresolved issues. — Martijn Pieters, Jan 12 '21 at 23:59
@AdamAL In 3.x, setting `ensure_ascii=False` is sufficient because *the result from `.dump` or `.dumps` is already a string*. Any encoding task is handled already by the encoding setting for the destination file (if applicable). — Karl Knechtel, Aug 04 '22 at 21:19

score 144 · Answer 2 · edited May 30 '20 at 23:28

144

To write to a file

import codecs
import json

with codecs.open('your_file.txt', 'w', encoding='utf-8') as f:
    json.dump({"message":"xin chào việt nam"}, f, ensure_ascii=False)

To print to stdout

import json
print(json.dumps({"message":"xin chào việt nam"}, ensure_ascii=False))

edited May 30 '20 at 23:28

Tansc

9
3

answered Nov 14 '16 at 09:35

Hiep Tran

3,735
1
21
29

1

SyntaxError: Non-ASCII character '\xc3' in file json-utf8.py on line 5, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details – Alex May 17 '17 at 07:08
Thank you! I didn't realize it was that simple. You only need to be careful if the data you are converting to json is untrusted user input. – Karim Sonbol Jun 29 '18 at 09:48
@Alex That's https://stackoverflow.com/questions/10589620/syntaxerror-non-ascii-character-xa3-in-file-when-function-returns-%c2%a3 – tripleee May 16 '21 at 07:55

score 31 · Answer 3 · edited Aug 05 '22 at 17:44

31

This is the wrong answer, but it's still useful to understand why it's wrong. See comments.

Use unicode-escape:

>>> d = {1: "ברי צקלה", 2: u"ברי צקלה"}
>>> json_str = json.dumps(d).decode('unicode-escape').encode('utf8')
>>> print json_str
{"1": "ברי צקלה", "2": "ברי צקלה"}

edited Aug 05 '22 at 17:44

Peter Mortensen

30,738
21
105
131

answered Sep 27 '14 at 19:41

monitorius

3,566
1
20
17

11

`unicode-escape` is not necessary: you could use `json.dumps(d, ensure_ascii=False).encode('utf8')` instead. And it is not guaranteed that json uses *exactly the same* rules as `unicode-escape` codec in Python in *all* cases i.e., the result might or might not be the same in some corner case. The downvote is for an unnecessary and possibly wrong conversion. Unrelated: `print json_str` works only for utf8 locales or if `PYTHONIOENCODING` envvar specifies utf8 here (print Unicode instead). – jfs May 11 '15 at 08:09
3

Another issue: any double quotes in string values will lose their escaping, so this'll result in *broken JSON output*. – Martijn Pieters Jun 06 '15 at 23:55
error in Python3 :AttributeError: 'str' object has no attribute 'decode' – Gank Apr 18 '16 at 13:59
1

unicode-escape works fine! I would accept this answer as correct one. – Worker May 11 '16 at 11:33
@jfs No, `json.dumps(d, ensure_ascii=False).encode('utf8')` is not working, for me at least. I'm getting `UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position ...`-error. The `unicode-escape` variant works fine however. – turingtested Nov 27 '18 at 10:09
@turingtested the error is likely in your other code. It is hard to say without a minimal complete code example that reproduces the issue. – jfs Nov 27 '18 at 11:42
Thanks for your answer, even though it's wrong in OP's case, it definitely pointed me in the right direction for serializing JSON for consumption by Postgres' COPY FROM STDIN command (this was driving me nuts !!) – bluu Jul 01 '21 at 12:41

score 30 · Answer 4 · edited Aug 05 '22 at 17:47

30

Pieters' Python 2 workaround fails on an edge case:

d = {u'keyword': u'bad credit  \xe7redit cards'}
with io.open('filename', 'w', encoding='utf8') as json_file:
    data = json.dumps(d, ensure_ascii=False).decode('utf8')
    try:
        json_file.write(data)
    except TypeError:
        # Decode data to Unicode first
        json_file.write(data.decode('utf8'))

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 25: ordinal not in range(128)

It was crashing on the .decode('utf8') part of line 3. I fixed the problem by making the program much simpler by avoiding that step as well as the special casing of ASCII:

with io.open('filename', 'w', encoding='utf8') as json_file:
  data = json.dumps(d, ensure_ascii=False, encoding='utf8')
  json_file.write(unicode(data))

cat filename
{"keyword": "bad credit  çredit cards"}

edited Aug 05 '22 at 17:47

Peter Mortensen

30,738
21
105
131

answered Jan 19 '15 at 20:14

Jonathan Ray

529
5
5

2

The 'edge case' was simply a dumb untested error on my part. Your `unicode(data)` approach is the better option rather than using exception handling. Note that the `encoding='utf8'` keyword argument has nothing to do with the output that `json.dumps()` produces; it is used for decoding *`str` input* the function receives. – Martijn Pieters Jan 27 '15 at 07:42
2

@MartijnPieters: or simpler: `open('filename', 'wb').write(json.dumps(d, ensure_ascii=False).encode('utf8'))` It works whether `dumps` returns (ascii-only) str or unicode object. – jfs Feb 07 '15 at 17:43
@J.F.Sebastian: right, because `str.encode('utf8')` *decodes* implicitly first. But so does `unicode(data)`, if given a `str` object. :-) Using `io.open()` gives you more options though, including using a codec that writes a BOM and you are following the JSON data with something else. – Martijn Pieters Feb 07 '15 at 17:46
1

@MartijnPieters: `.encode('utf8')`-based variant works on both Python 2 and 3 (the same code). There is no `unicode` on Python 3. Unrelated: json files should not use BOM (though a confirming json parser may ignore BOM, see [errate 3983](http://www.rfc-editor.org/errata_search.php?rfc=7159)). – jfs May 11 '15 at 07:55
adding `encoding='utf8'` to `json.dumps` solves the problem. P.S. I have a cyrillic text to dump – Max L Feb 07 '16 at 18:30

score 30 · Answer 5 · answered Jan 20 '19 at 13:56

30

As of Python 3.7 the following code works fine:

from json import dumps
result = {"symbol": "ƒ"}
json_string = dumps(result, sort_keys=True, indent=2, ensure_ascii=False)
print(json_string)

Output:

{"symbol": "ƒ"}

answered Jan 20 '19 at 13:56

Nik

9,063
7
66
81

2

also in python 3.6 (just verified). – Berry Tsakala Feb 13 '19 at 17:20
1

worked well without further complexity.. – Zaman Jan 18 '23 at 10:06

score 22 · Answer 6 · edited Aug 08 '22 at 19:36

Thanks for the original answer here. With Python 3 the following line of code:

print(json.dumps(result_dict,ensure_ascii=False))

was ok. Consider trying not writing too much text in the code if it's not imperative.

This might be good enough for the Python console. However, to satisfy a server, you might need to set the locale as explained here (if it is on Apache 2) Setting LANG and LC_ALL when using mod_wsgi

Basically, install he_IL or whatever language locale on Ubuntu. Check it is not installed:

locale -a

Install it, where XX is your language:

sudo apt-get install language-pack-XX

For example:

sudo apt-get install language-pack-he

Add the following text to /etc/apache2/envvrs

export LANG='he_IL.UTF-8'
export LC_ALL='he_IL.UTF-8'

Then you would hopefully not get Python errors on from Apache like:

print (js) UnicodeEncodeError: 'ascii' codec can't encode characters in position 41-45: ordinal not in range(128)

Also in Apache, try to make UTF the default encoding as explained here: How to change the default encoding to UTF-8 for Apache

Do it early because Apache errors can be pain to debug and you can mistakenly think it's from Python which possibly isn't the case in that situation.

score 9 · Answer 7 · edited Aug 08 '22 at 20:10

9

Use unicode-escape to solve the problem

>>>import json
>>>json_string = json.dumps("ברי צקלה")
>>>json_string.encode('ascii').decode('unicode-escape')
'"ברי צקלה"'

Explanation

>>>s = '漢  χαν  хан'
>>>print('Unicode: ' + s.encode('unicode-escape').decode('utf-8'))

Unicode: \u6f22  \u03c7\u03b1\u03bd  \u0445\u0430\u043d

>>>u = s.encode('unicode-escape').decode('utf-8')
>>>print('Original: ' + u.encode("utf-8").decode('unicode-escape'))

Original: 漢  χαν  хан

Original resource：Python3 使用 unicode-escape 处理 unicode 16进制字符串编解码问题

edited Aug 08 '22 at 20:10

Peter Mortensen

30,738
21
105
131

answered Feb 23 '20 at 09:58

ChrisXiao

91
1
2

1

That's silly use the library's built-in feature `ensure_ascii=False` instead of rolling your own. (But understand that saving JSON as bare UTF-8 can introduce interoperability problems, especially on Windows.) – tripleee May 16 '21 at 07:59
@tripleee its not silly - its the only solution that gives the exact result which is a similar encoding to file write with utf-8. – Mitzi Mar 24 '22 at 11:29
thanks :) it helped me to convert a response i received with text to convert it to german readable text – Thomas Krickl May 29 '23 at 21:01

Cheney · Answer 8 · 2017-01-09T07:06:06.093

The following is my understanding var reading answer above and google.

# coding:utf-8
r"""
@update: 2017-01-09 14:44:39
@explain: str, unicode, bytes in python2to3
    #python2 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 7: ordinal not in range(128)
    #1.reload
    #importlib,sys
    #importlib.reload(sys)
    #sys.setdefaultencoding('utf-8') #python3 don't have this attribute.
    #not suggest even in python2 #see:http://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script
    #2.overwrite /usr/lib/python2.7/sitecustomize.py or (sitecustomize.py and PYTHONPATH=".:$PYTHONPATH" python)
    #too complex
    #3.control by your own (best)
    #==> all string must be unicode like python3 (u'xx'|b'xx'.encode('utf-8')) (unicode 's disappeared in python3)
    #see: http://blog.ernest.me/post/python-setdefaultencoding-unicode-bytes

    #how to Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence
    #http://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence
"""

from __future__ import print_function
import json

a = {"b": u"中文"}  # add u for python2 compatibility
print('%r' % a)
print('%r' % json.dumps(a))
print('%r' % (json.dumps(a).encode('utf8')))
a = {"b": u"中文"}
print('%r' % json.dumps(a, ensure_ascii=False))
print('%r' % (json.dumps(a, ensure_ascii=False).encode('utf8')))
# print(a.encode('utf8')) #AttributeError: 'dict' object has no attribute 'encode'
print('')

# python2:bytes=str; python3:bytes
b = a['b'].encode('utf-8')
print('%r' % b)
print('%r' % b.decode("utf-8"))
print('')

# python2:unicode; python3:str=unicode
c = b.decode('utf-8')
print('%r' % c)
print('%r' % c.encode('utf-8'))
"""
#python2
{'b': u'\u4e2d\u6587'}
'{"b": "\\u4e2d\\u6587"}'
'{"b": "\\u4e2d\\u6587"}'
u'{"b": "\u4e2d\u6587"}'
'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'

'\xe4\xb8\xad\xe6\x96\x87'
u'\u4e2d\u6587'

u'\u4e2d\u6587'
'\xe4\xb8\xad\xe6\x96\x87'

#python3
{'b': '中文'}
'{"b": "\\u4e2d\\u6587"}'
b'{"b": "\\u4e2d\\u6587"}'
'{"b": "中文"}'
b'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'

b'\xe4\xb8\xad\xe6\x96\x87'
'中文'

'中文'
b'\xe4\xb8\xad\xe6\x96\x87'
"""

The first sentence is incomprehensible. Can you [fix it](https://stackoverflow.com/posts/41521794/edit)? — Peter Mortensen, Aug 08 '22 at 19:22

score 7 · Answer 9 · edited Aug 08 '22 at 19:28

7

If you are loading a JSON string from a file and the file content is Arabic texts, then this will work.

Assume a file like arabic.json

{
  "key1": "لمستخدمين",
  "key2": "إضافة مستخدم"
}

Get the Arabic contents from the arabic.json file

with open(arabic.json, encoding='utf-8') as f:
   # Deserialises it
   json_data = json.load(f)
   f.close()

# JSON formatted string
json_data2 = json.dumps(json_data, ensure_ascii = False)

To use JSON data in a Django template follow the below steps:

# If have to get the JSON index in a Django template file, then simply decode the encoded string.

json.JSONDecoder().decode(json_data2)

Done! Now we can get the results as a JSON index with Arabic values.

edited Aug 08 '22 at 19:28

Peter Mortensen

30,738
21
105
131

answered Jul 30 '19 at 11:07

Chandan Sharma

2,321
22
22

`fh.close()` `fh` is undefined. – AMC Feb 21 '20 at 20:12
IT's Corrected now. It would be ```f.close()``` – Chandan Sharma Feb 26 '20 at 09:49
You are using a context manager (`with open....`) so the `close` is unnecessary anyway. Python silently accepts it for now, but it's really an error, which could be exposed in a future version. – tripleee Mar 10 '23 at 06:21

score 6 · Answer 10 · answered Aug 26 '16 at 09:56

Here's my solution using json.dump():

def jsonWrite(p, pyobj, ensure_ascii=False, encoding=SYSTEM_ENCODING, **kwargs):
    with codecs.open(p, 'wb', 'utf_8') as fileobj:
        json.dump(pyobj, fileobj, ensure_ascii=ensure_ascii,encoding=encoding, **kwargs)

where SYSTEM_ENCODING is set to:

locale.setlocale(locale.LC_ALL, '')
SYSTEM_ENCODING = locale.getlocale()[1]

score 5 · Answer 11 · answered Aug 14 '18 at 06:58

5

Use codecs if possible,

with codecs.open('file_path', 'a+', 'utf-8') as fp:
    fp.write(json.dumps(res, ensure_ascii=False))

answered Aug 14 '18 at 06:58

Yulin GUO

161
2
3

1

What is "codecs"? A Python package? What import statement is required for it? Something else? What is the advantage of using it? Why did you choose it? For instance, you could link to documentation for it. Please respond by [editing (changing) your answer](https://stackoverflow.com/posts/51835456/edit), not here in comments (***without*** "Edit:", "Update:", or similar - the answer should appear as if it was written today). – Peter Mortensen Aug 08 '22 at 19:23

score -3 · Answer 12 · edited Aug 05 '22 at 17:43

Using ensure_ascii=False in json.dumps is the right direction to solve this problem, as pointed out by Martijn. However, this may raise an exception:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 1: ordinal not in range(128)

You need extra settings in either site.py or sitecustomize.py to set your sys.getdefaultencoding() correct. site.py is under lib/python2.7/ and sitecustomize.py is under lib/python2.7/site-packages.

If you want to use site.py, under def setencoding(): change the first if 0: to if 1: so that Python will use your operation system's locale.

If you prefer to use sitecustomize.py, which may not exist if you haven't created it, simply add these lines:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Then you can do some Chinese JSON output in UTF-8 format, such as:

name = {"last_name": u"王"}
json.dumps(name, ensure_ascii=False)

You will get an UTF-8 encoded string, rather than a \u escaped JSON string.

To verify your default encoding:

print sys.getdefaultencoding()

You should get "utf-8" or "UTF-8" to verify your site.py or sitecustomize.py settings.

Please note that you could not do sys.setdefaultencoding("utf-8") at an interactive Python console.

no. Don't do it. Modifying default character encoding has nothing to do with `json`'s `ensure_ascii=False`. Provide a minimal complete code example if you think otherwise. — jfs, Jan 05 '14 at 02:49
You only get this exception if you either feed in non-ASCII *byte strings* (e.g. not Unicode values) or try to combine the resulting JSON value (a Unicode string) with a non-ASCII byte string. Setting the default encoding to UTF-8 is essentially masking an underlying problem were you are not managing your string data properly. — Martijn Pieters, May 15 '14 at 00:09

Saving UTF-8 texts with json.dumps as UTF-8, not as a \u escape sequence

12 Answers12

Use unicode-escape to solve the problem

Explanation

Assume a file like arabic.json

Get the Arabic contents from the arabic.json file

To use JSON data in a Django template follow the below steps:

Linked

Related