Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.


Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C


The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions
34 answers

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

I'm having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup. The problem is that the error is not always reproducible; it sometimes works with some pages, and…
Homunculus Reticulli
  • 65,167
  • 81
  • 216
  • 341
8 answers

Why is executing Java code in comments with certain Unicode characters allowed?

The following code produces the output "Hello World!" (no really, try it). public static void main(String... args) { // The comment below is not a typo. // \u000d System.out.println("Hello World!"); } The reason for this is that the Java…
  • 10,717
  • 6
  • 37
  • 54
20 answers

What characters can be used for up/down triangle (arrow without stem) for display in HTML?

I'm looking for a HTML or ASCII character which is a triangle pointing up or down so that I can use it as a toggle switch. I found ↑ (↑), and ↓ (↓) - but those have a narrow stem. I'm looking just for the HTML arrow "head".
  • 13,937
  • 3
  • 18
  • 9
12 answers

What does the 'b' character do in front of a string literal?

Apparently, the following is the valid syntax: b'The string' I would like to know: What does this b character in front of the string mean? What are the effects of using it? What are appropriate situations to use it? I found a related question…
Jesse Webb
  • 43,135
  • 27
  • 106
  • 143
9 answers

What's the difference between utf8_general_ci and utf8_unicode_ci?

Between utf8_general_ci and utf8_unicode_ci, are there any differences in terms of performance?
KahWee Teng
  • 13,658
  • 3
  • 21
  • 21
22 answers

What's the difference between UTF-8 and UTF-8 with BOM?

What's different between UTF-8 and UTF-8 with BOM? Which is better?
  • 10,723
  • 3
  • 17
  • 11
14 answers

UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error: Traceback (most recent call last): File "SCRIPT LOCATION", line NUMBER, in…
Eden Crow
  • 14,684
  • 11
  • 26
  • 24
13 answers

std::wstring VS std::string

I am not able to understand the differences between std::string and std::wstring. I know wstring supports wide characters such as Unicode characters. I have got the following questions: When should I use std::wstring over std::string? Can…
12 answers

Saving UTF-8 texts with json.dumps as UTF-8, not as a \u escape sequence

Sample code (in a REPL): import json json_string = json.dumps("ברי צקלה") print(json_string) Output: "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4" The problem: it's not human readable. My (smart) users want to verify or even edit text files with…
Berry Tsakala
  • 15,313
  • 12
  • 57
  • 80
7 answers

What exactly do "u" and "r" string prefixes do, and what are raw string literals?

While asking this question, I realized I didn't know much about raw strings. For somebody claiming to be a Django trainer, this sucks. I know what an encoding is, and I know what u'' alone does since I get what is Unicode. But what does r'' do…
Bite code
  • 578,959
  • 113
  • 301
  • 329
17 answers

MySQL: Get character-set of database or table or column?

What is the (default) charset for: MySQL database MySQL table MySQL column
  • 58,203
  • 71
  • 188
  • 248
14 answers

What is the best way to remove accents (normalize) in a Python unicode string?

I have a Unicode string in Python, and I would like to remove all the accents (diacritics). I found on the web an elegant way to do this (in Java): convert the Unicode string to its long normalized form (with a separate character for letters and…
  • 46,633
  • 36
  • 147
  • 183
2 answers

How does Zalgo text work?

I've seen weirdly formatted text called Zalgo like below written on various forums. It's kind of annoying to look at, but it really bothers me because it undermines my notion of what a character is supposed to be. My understanding is that a…
  • 58,961
  • 76
  • 175
  • 221
18 answers

What is the difference between UTF-8 and Unicode?

I have heard conflicting opinions from people - according to the Wikipedia UTF-8 page. They are the same thing, aren't they? Can someone clarify?
  • 26,667
  • 58
  • 180
  • 286
25 answers

UnicodeDecodeError when reading CSV file in Pandas

I'm running a program which is processing 30,000 similar files. A random number of them are stopping and producing this error... File "C:\Importer\src\dfman\", line 26, in import_chr data = pd.read_csv(filepath, names=fields) File…
  • 20,342
  • 13
  • 37
  • 41
2 3
99 100