Re: [sqlite] How can I detect rows with non-ASCII values?

Richard Damon Sat, 18 Jan 2020 17:37:56 -0800

On 1/18/20 3:21 AM, Rocky Ji wrote:

Hi,


I am asked to highlight rows containing strange characters. All data were
ingested by a proprietary crawler.

By strange, I mean, question marks, boxes, little Christmas Trees,  solid
arrows, etc. kind of symbols; these appear suddenly in flow of normal ASCII
English letters.

How do I approach this?

Thanks.

The first thing that you are going to need to find to do this is how thedata has actually been stored. The strange characters you are describingsound like somewhere some data obtained in one character encoding, butthen interpreted as another. This is a very old problem, without asolution in general except to always know the encoding of your data. Itgoes back to the development of the ASCII character set which is thebase for most character sets in use today (another major branch isEBCDIC, but that has its own issues and you tend to know when you needto deal with that).

The A in ASCII is for American, the ASCII character set was designed forAmerican (i.e. english) data, and when it when it was developed, memoryand bandwidth were limited and expensive, so you didn't waste them.ASCII was a compact set, using only 7 bits per character, and was finefor english information, with the basic alphabet. It became the basestandard because America was a core part in the initial computerdevelopment, and had a lot of influence in the initial standards.

While it worked well for American data, it didn't work so well for manyother countries, so most other countries adopted their own character setfor use in their country, normally based on ASCII as a base, but addingcodes to extend the coding to an 8 bit code, keep (at least most of)ASCII as the first 128 values.

To exchange data between character machines, you needed to include whatcharacter set the data was in (or you see some funny words due to themis-match).

Operating systems adopted the concept of Code Pages, which basicallydefined which of the many standard character sets was to be used, andsome transfer formats actually included as part of the headerinformation what character set the data that follows was in. One ofthese was the web pages that were on the Internet.

Later, to try and get out of this mess, a new character encoding wasinvented called Unicode, Unicode initially intended to provide a singleuniversal encoding that would let any of the many standard encodings beconverted to this universal encoding. It was first thought that thiscould be done with a 16 bit character set, but it later needed to beenlarged in size as they found out how many different characters therereally were. While memory isn't quite as precious as it was when ASCIIwas designed, it isn't so abundant that we could just increase the sizeof out data by a factor, so a compact encoding was developed calledUTF-8, which represents the ASCII characters exactly as the ASCIIcharacter set, and all the extra characters with multiple bytes.

Because it did make files with the extra characters longer, and it wassomewhat complicated to work with, and most people only worked withdocuments that could all be encoded in a single character set, itsadoption was slow, but it now is becoming a norm, but still many thingsare in the old legacy encodings.

If you try to interpret a file that is in one of the legacy encodings asUnicode UTF-8, then (if it uses extended characters) it almost certainlywill create decoding errors (UTF-8 was intentionally designed withredundancy in the encoding to easy processing, so many 'random'sequences become invalid). If you interpret a file that is in UTF-8 anda legacy encoding, you will tend to get some strange out of placeextended characters.

My first guess is that your proprietary crawler either didn't properlydetect the page encoding and handle it, ideally it would have convertedit to UTF-8, but in might also save the data in the original encodingand saved what that encoding was, or your read out program isn'tdetecting the character set the data was stored as and processing itright. I believe SQLite assumes that 'TEXT' data is UTF-8 encoded, butother encodings can be declared or data stored as BLOBs.

What likely should happen is someone (maybe you) needs to read outsamples of the funny data as a blob, and figure out how the data isactually encoded, ideally comparing it to the original page crawled, andonce you know what the problem was, you can perhaps work on fixing itand detecting the records with problems.

One possible issue is that some conversion routines take characters theydon't know how to handle and replace them with the ASCII Question Mark,and if that is what has been stored in the database, it may be very hardto distinguish that from an actual question mark in the data.


--
Richard Damon

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] How can I detect rows with non-ASCII values?

Reply via email to