Querying non-ASCII rows from Postgres

Question

Does [:ascii:] class work in Postgres at all? It is not listed in their help, however I see examples in the web which utilize it.

I have a UTF-8 database, where collation and c_type are en_US.UTF-8, and Postgres version is 9.6.2. When I search for non-ASCII rows like this:

select title from wallabag_entry where title ~ '[^[:ascii:]]';

I get both Unicode and non-Unicode symbols (full output is here):

Сталинская правозащитница: мать Меленкова бабушка Настя
Дневник НКВДиста Шабалина: Знает ли Москва положение на фронте?
Бег по городу и поездка на осле: как в средневековье наказывали прелюбодеев
Как комиссар Крекшин в 1740-е чуть не отменил историю России
Have you heard of Saint Death? Don’t pray to her.
Архаїчна українська мова: перевага чи недолік?
Гренада не их
China’s marriage rate is plummeting because women are choosing autonomy over

What is wrong with this query?

Is it possible that you're getting sentences with Unicode non-breakable spaces? (or any other character that hides in plain view, for that matter) — joanolo, Mar 19 '17 at 18:09
You can use a `regexp_replace()` to mark your non-ASCII chars. See my answer. — joanolo, Mar 19 '17 at 18:31
You should always paste the *exact* result in dba.se. We can't test a graphic for non-ascii characters. we can test the actual result set. This is a poster child for *shouldn't be a graphic* — Evan Carroll, Mar 19 '17 at 18:34
Just out of curiosity why do you care if the characters are ASCII either way? — Evan Carroll, Mar 19 '17 at 20:49
Just to add my two cents: while the answer of joanolo is spectacular, it didn't help me to solve this concrete problem. Except right quotes my dataset have a bunch of other confusing characters (alike spaces, ", «) which makes it impossible to use `[:ascii:]` class anyway. What really helped me in this problem is a concept of unicode blocks, which I learned from [this fabulous regex tutorial](http://www.regular-expressions.info/unicode.html). — Suncatcher, Mar 19 '17 at 21:04
`[\u0400-\u04FF]` is a pattern which perfectly separates Cyrillic from non-Cyrillic characters — Suncatcher, Mar 19 '17 at 21:05

joanolo · Accepted Answer · 2019-09-10T20:39:57.477

To answer your question: [:ascii:] works. You may have some characters in your text that you do not recognize as non-ASCII, yet they're there. They can be something like a non-breakable space, for instance, or any other Unicode space character.

It is not strange to have non-breakable spaces ( ) in texts that you copy-and-paste from a web page, yet you don't notice they're there.

Here is an example to show:

WITH t(t) AS
(
    VALUES 
      ( 'Сталинская правозащитница: мать Меленкова бабушка Настя' ),
      ( 'Дневник НКВДиста Шабалина: Знает ли Москва положение на фронте?' ),
      ( 'Бег по городу и поездка на осле: как в средневековье наказывали прелюбодеев' ),
      ( 'Как комиссар Крекшин в 1740-е чуть не отменил историю России' ),
      ( 'Have you heard of Saint Death? Don’t pray to her.' ),
      ( 'Архаїчна українська мова: перевага чи недолік?' ),
      ( 'Гренада не их' ),
      ( 'China’s marriage rate is plummeting because women are choosing autonomy over ' )

)
SELECT 
    t,  regexp_replace(t, '([^[:ascii:]])', '[\1]', 'g') AS t_marked
FROM 
    t 
WHERE 
    t ~ '[^[:ascii:]]' ;

That's what you get:

                                       t                                       |                                                                                                 t_marked                                                                                                  
-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Сталинская правозащитница: мать Меленкова бабушка Настя                       | [С][т][а][л][и][н][с][к][а][я] [п][р][а][в][о][з][а][щ][и][т][н][и][ц][а]: [м][а][т][ь] [М][е][л][е][н][к][о][в][а] [б][а][б][у][ш][к][а] [Н][а][с][т][я]
 Дневник НКВДиста Шабалина: Знает ли Москва положение на фронте?               | [Д][н][е][в][н][и][к] [Н][К][В][Д][и][с][т][а] [Ш][а][б][а][л][и][н][а]: [З][н][а][е][т] [л][и] [М][о][с][к][в][а] [п][о][л][о][ж][е][н][и][е] [н][а] [ф][р][о][н][т][е]?
 Бег по городу и поездка на осле: как в средневековье наказывали прелюбодеев   | [Б][е][г] [п][о] [г][о][р][о][д][у] [и] [п][о][е][з][д][к][а] [н][а] [о][с][л][е]: [к][а][к] [в] [с][р][е][д][н][е][в][е][к][о][в][ь][е] [н][а][к][а][з][ы][в][а][л][и] [п][р][е][л][ю][б][о][д][е][е][в]
 Как комиссар Крекшин в 1740-е чуть не отменил историю России                  | [К][а][к] [к][о][м][и][с][с][а][р] [К][р][е][к][ш][и][н] [в] 1740-[е] [ч][у][т][ь] [н][е] [о][т][м][е][н][и][л] [и][с][т][о][р][и][ю] [Р][о][с][с][и][и]
 Have you heard of Saint Death? Don’t pray to her.                             | Have you heard of Saint Death? Don[’]t pray to her.
 Архаїчна українська мова: перевага чи недолік?                                | [А][р][х][а][ї][ч][н][а] [у][к][р][а][ї][н][с][ь][к][а] [м][о][в][а]: [п][е][р][е][в][а][г][а] [ч][и] [н][е][д][о][л][і][к]?
 Гренада не их                                                                 | [Г][р][е][н][а][д][а] [н][е] [и][х]
 China’s marriage rate is plummeting because women are choosing autonomy over  | China[’]s marriage rate is plummeting because women are choosing autonomy over

You can see from this, that your problem is the right-apostrophe character. ASCII only supports the apostrophe. Left-apostrophe and right-apostrophe are typographically correct Unicode extensions.

dbfiddle here

You can check it also with previous versions at http://rextester.com/UKIQ48014 (PostgreSQL 9.5) and http://sqlfiddle.com/#!15/4c563/1/0 (PostgreSQL 9.3)

The texts that I guess you think are pure ASCII, and are not:

 WITH t(t) AS
 (
     VALUES 
       ('A fully ASCII text!'),
       ('Have you heard of Saint Death? Don’t pray to her.'),
       ('China’s marriage rate is plummeting because women are choosing autonomy over ')
 )
 SELECT 
    regexp_replace(t, '([^[:ascii:]])', '[\1]', 'g') AS t_marked
 FROM 
    t 
 WHERE 
    t ~ '[^[:ascii:]]' ;

 | t_marked                                                                        |
 | :------------------------------------------------------------------------------ |
 | Have you heard of Saint Death? Don[’]t pray to her.                             |
 | China[’]s marriage rate is plummeting because women are choosing autonomy over  |

dbfiddle here

These texts are using ’ instead of ' to mark apostrophes.

Check Punctuation: Why is the right single quote (U+2019), and not the semantically distinct apostrophe (U+0027), the preferred apostrophe character in Unicode? ... to see that you're not the first person encountering this problem.

This is a really fantastic answer because it shows you the non-ascii characters. This is how I would have answered this question. — Evan Carroll, Mar 19 '17 at 18:33

Querying non-ASCII rows from Postgres

1 Answers1