Questions tagged [character-encoding]

Character encoding refers to the way characters are represented as a series of bytes. Character encoding for the Web is defined in the Encoding Standard.

Character encoding is the act or result of representing characters (human-readable text/symbols such as a or 汉 or ) as a series of bytes (computer-readable zeroes and ones).

Briefly, just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes¹ 0xE2 0x89 0xA0 could represent the text â‰ in Windows code page 1252, or Б┴═ in KOI8-R, or the character ≠ in UTF-8.

A useful reference is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The Encoding Standard at https://encoding.spec.whatwg.org/ defines character encoding for the Web. It mandates the use of UTF-8 on the Web, and defines other encodings as legacy/obsolete.

Of course, if the file you are looking at does not contain text, that means it does not encode any characters, and thus, character encoding is not meaningful or well-defined. A common beginner problem is trying to read a binary file as text and being surprised that you get a character encoding error. But the fix in this situation is to read the file in binary mode instead. For example, many office document, audio, video, and image formats, and proprietary file formats are binary files.

How Can I Fix the Encoding?

If you are a beginner who just needs to fix an acute problem with a text file, see if your text editor provides an option to save a file in a different encoding. Understand that not all encodings can accommodate all characters (so, for example, Windows code page 1252 cannot save text which contains Chinese or Russian characters, emoji, etc) or, if you know the current encoding and what you want to change it into, try a tool like iconv or GNU recode.

Which Character Encoding is This?

Questions asking for help identifying or manipulating text in a particular encoding are frequent, but often problematic. Please include enough information to help us help you.

Bad: "I look at the text and I see óòÒöô, what is this"?

Good: "I have text in an unknown encoding in a file. I cannot view this text in UTF-8, but when I set my system to use ISO-8859-1, I see óòÒöô. I know this isn't right; the text is supposed to be <text> in <language>. A hex dump of the beginning of the file shows

    000000 9e 9f 9a a0 af b4 be f0  9e af b3 f2 20 b7 5f 20

Bad: Anything which tries to use the term "ANSI" in this context²

Legacy Microsoft Windows documentation misleadingly uses "ANSI" to refer to whichever character set is the default for the current locale. But this is a moving target; now, we have to guess your current locale, too.

Better: Specify the precise code page

Commonly on Western Windows installations, you will be using CP-1252; but of course, if you have to guess, you need to say so, too.

Notice:

We cannot guess which encoding you are using to look at the mystery data. Please include this information if you are genuinely trying to tell us what you see.
A copy/paste is rarely sufficient, because this introduces several additional variables (we will need to correctly guess about your web browser's handling of the text, too, and the web server's, and the tool you used to obtain a copy of the text, and so forth).
If you know what the text is supposed to represent (even vaguely) this can help narrow down the problem significantly.
A hex dump is the only unambiguous representation, but please don't overdo it -- a few lines of sample data should usually suffice.

Common Questions

¹ When talking about encoding, hex representations are often used since they are more concise -- 0xE2 is the hex representation of the byte 11100010.

² The American National Standards Institute has standardized some character sets (notably ASCII; ANSI standard ANSI X3.4-1986) and text display formatting codes, but certainly not the Microsoft Windows code pages or the mechanism for how one of them is selected.

How do I get a consistent byte representation of strings in C# without manually specifying an encoding?

How do I convert a string to a byte[] in .NET (C#) without manually specifying a specific encoding? I'm going to encrypt the string. I can encrypt it without converting, but I'd still like to know why encoding comes to play here. Also, why should…

c# .net string character-encoding

asked Jan 23 '09 at 13:39

Agnel Kurian

57,975
43
146
217

1434

votes

5 answers

Best way to convert string to bytes in Python 3?

TypeError: 'str' does not support the buffer interface suggests two possible methods to convert a string to bytes: b = bytes(mystring, 'utf-8') b = mystring.encode('utf-8') Which method is more Pythonic? See Convert bytes to a string for the…

python string character-encoding python-3.x

asked Sep 28 '11 at 15:14

Mark Ransom

299,747
42
398
622

1059

votes

22 answers

What's the difference between UTF-8 and UTF-8 with BOM?

What's different between UTF-8 and UTF-8 with BOM? Which is better?

unicode utf-8 character-encoding byte-order-mark

asked Feb 08 '10 at 18:26

simple

10,723
3
17
11

786

votes

17 answers

MySQL: Get character-set of database or table or column?

What is the (default) charset for: MySQL database MySQL table MySQL column

sql mysql unicode character-encoding collation

asked Jun 26 '09 at 15:22

Amandasaurus

58,203
71
188
248

724

votes

18 answers

What is the difference between UTF-8 and Unicode?

I have heard conflicting opinions from people - according to the Wikipedia UTF-8 page. They are the same thing, aren't they? Can someone clarify?

unicode encoding utf-8 character-encoding terminology

asked Mar 13 '09 at 17:06

sarsnake

26,667
58
180
286

532

votes

20 answers

How to convert an entire MySQL database characterset and collation to UTF-8?

How can I convert entire MySQL database character-set to UTF-8 and collation to UTF-8?

mysql character-encoding

asked May 24 '11 at 19:12

Dean

7,814
8
30
31

506

votes

5 answers

What is the difference between utf8mb4 and utf8 charsets in MySQL?

What is the difference between utf8mb4 and utf8 charsets in MySQL? I already know about ASCII, UTF-8, UTF-16 and UTF-32 encodings; but I'm curious to know whats the difference of utf8mb4 group of encodings with other encoding types defined in MySQL…

mysql encoding utf-8 character-encoding utf8mb4

asked May 06 '15 at 10:45

Mojtaba Rezaeian

8,268
8
31
54

497

votes

8 answers

What is the difference between UTF-8 and ISO-8859-1?

utf-8 character-encoding iso-8859-1

asked Aug 13 '11 at 05:21

Jagadesh

6,489
8
29
30

460

votes

2 answers

Working with UTF-8 encoding in Python source

Consider: $ cat bla.py u = unicode('d…') s = u.encode('utf-8') print s $ python bla.py File "bla.py", line 1 SyntaxError: Non-ASCII character '\xe2' in file bla.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html…

python encoding utf-8 character-encoding

asked Jun 09 '11 at 07:29

Nullpoet

10,949
20
48
65

431

votes

13 answers

Why do we use Base64?

Wikipedia says Base64 encoding schemes are commonly used when there is a need to encode binary data that needs be stored and transferred over media that are designed to deal with textual data. This is to ensure that the data remains intact without…

algorithm character-encoding binary ascii base64

asked Aug 21 '10 at 15:21

Lazer

90,700
113
281
364

421

votes

7 answers

No line-break after a hyphen

I'm looking to prevent a line break after a hyphen - on a case-by-case basis that is compatible with all browsers. Example: I have this text: 3-3/8" which in HTML is this: 3-3/8” The problem is that near the end of a line, because of the…

html css character-encoding line-breaks hyphenation

asked Oct 07 '11 at 18:46

Sparky

98,165
25
199
285

408

votes

18 answers

Setting the default Java character encoding

How do I properly set the default character encoding used by the JVM (1.5.x) programmatically? I have read that -Dfile.encoding=whatever used to be the way to go for older JVMs. I don't have that luxury for reasons I wont get into. I have…

java utf-8 character-encoding

asked Dec 12 '08 at 05:31

Scott T

394

votes

2 answers

Unicode, UTF, ASCII, ANSI format differences

What is the difference between the Unicode, UTF8, UTF7, UTF16, UTF32, ASCII, and ANSI encodings? In what way are these helpful for programmers?

unicode character-encoding ascii ansi utf

asked Mar 31 '09 at 06:02

web dunia

9,381
18
52
64

378

votes

7 answers

What does "Content-type: application/json; charset=utf-8" really mean?

When I make a POST request with a JSON body to my REST service I include Content-type: application/json; charset=utf-8 in the message header. Without this header, I get an error from the service. I can also successfully use Content-type:…

character-encoding mime-types

asked Feb 13 '12 at 02:37

DenaliHardtail

27,362
56
154
233

365

votes

20 answers

"for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte

Here is my code, for line in open('u.item'): # Read each line Whenever I run this code it gives the following error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2892: invalid continuation byte I tried to solve this and…

python python-3.x character-encoding

asked Oct 31 '13 at 05:55

SujitS

11,063
3
19
41

2 3

…

99 100 Next

Questions tagged [character-encoding]

How Can I Fix the Encoding?

Which Character Encoding is This?

Common Questions

See Also