Re: Having strange result on processing UTF-8 file
On Sat, 2021-12-18 at 20:40 -0500, Tom Horsley wrote: > Just a (possibly) relevant note: I've seen many html pages with > headers claiming they are UTF-8, but text that only displays > correctly if you treat them as one of the windows code pages. > > Worse yet, some browsers have heuristics to detect this and display > them "correctly", so the creators usually never notice. Supplementary: Some webpages use a mixture of different encoding, because the author has just dragged crap in from various sources without even considering character encoding. -- uname -rsvp Linux 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 30 15:51:32 UTC 2021 x86_64 Boilerplate: All unexpected mail to my mailbox is automatically deleted. I will only get to see the messages that are posted to the mailing list. ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: Having strange result on processing UTF-8 file
On 19/12/2021 09:50, Michael D. Setzer II wrote: %10.10s and %20.20s both would cause the problem. I believe those are both printf format indicators. Which is why I was wondering if converting to plain text would be better because those would be removed (dealt with) during the convert. -- Did 황준호 die? ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: Having strange result on processing UTF-8 file
On 19 Dec 2021 at 9:14, Ed Greshko wrote: From: Ed Greshko Date sent: Sun, 19 Dec 2021 09:14:37 +0800 Subject:Re: Having strange result on processing UTF-8 file To: "Michael D. Setzer II" , Community support for Fedora users Send reply to: Community support for Fedora users > On 19/12/2021 08:31, Michael D. Setzer II wrote: > > But could change if they add more or remove some > currently 633 records. Some lines in the file are over > 25000 characters?? Total download is about 13M. > The actual lines I need for the data are just 256K, so it > has lots of junk (stuff I don't need for what I'm doing). > > That 13M file. Does it contain html? If so, would it be easier > to work with if it was converted to plain text? Yes, they are all html pages, but some of the UTF-8 characters don't match to a plain text charter and it is the name field. Did figure out the issue. %10.10s and %20.20s both would cause the problem. So I used the head command to pull various number of lines until I found where the file went Non-ISO extended ascii. Was only a few lines that caused issue, and it was the last character in substring being a character above 127. So added these commands to copy 30 characters from the point, but would then go from end and if last character was >127 change it to null strcpy(linex,&line[i]); linex[30]=0; while(linex[strlen(linex)-1]>127) linex[strlen(linex)-1]=0; The used %s and just printed linex. 218544 lines in allraw.uog 1898 lines in allraw.uog.out (lines with utf-8) The uog.csv has 633 lines but only these 3 have utf-8 13127 c3b1 [ña, Ph.D.;Crisostomo-Muña;Do] 13151 c3b1 [ña;Doreen;Professor of Accoun] 27614 c3a5 [åni" Isidro;Isidro;Jaevani;Ju] 34418 c381 [Álvarez-Piñer, Ph.D.;Madrid ] 34429 c3b1 [ñer, Ph.D.;Madrid Álvarez-Pi] 34448 c381 [Álvarez-Piñer;Carlos;Directo] 34459 c3b1 [ñer;Carlos;Director / Associa] Whole web page has a lot of other utf-8 characters. Thanks again. > -- > Did 황준호 die? ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: Having strange result on processing UTF-8 file
On Sun, 19 Dec 2021 09:14:37 +0800 Ed Greshko wrote: > Does it contain html? Just a (possibly) relevant note: I've seen many html pages with headers claiming they are UTF-8, but text that only displays correctly if you treat them as one of the windows code pages. Worse yet, some browsers have heuristics to detect this and display them "correctly", so the creators usually never notice. ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: Having strange result on processing UTF-8 file
On 19/12/2021 08:31, Michael D. Setzer II wrote: But could change if they add more or remove some currently 633 records. Some lines in the file are over 25000 characters?? Total download is about 13M. The actual lines I need for the data are just 256K, so it has lots of junk (stuff I don't need for what I'm doing). That 13M file. Does it contain html? If so, would it be easier to work with if it was converted to plain text? -- Did 황준호 die? ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: Having strange result on processing UTF-8 file
On 19 Dec 2021 at 7:54, Ed Greshko wrote: From: Ed Greshko Date sent: Sun, 19 Dec 2021 07:54:31 +0800 Subject:Re: Having strange result on processing UTF-8 file To: users@lists.fedoraproject.org Send reply to: Community support for Fedora users > On 19/12/2021 02:15, Michael D. Setzer II via users wrote: > > Download 64 web pages into a single file using wget2. That is fine. > > One more thing. > > The single file you get is an html formatted file, yes? For the results that > you want, and how you want to > use it, do you really want html? If not, why don't you convert to plain text? > > Can we assume the 64 pages are always the same pages? > Yes. Figured a work around, but not exactly sure that the issue is that changes the file from UTF-8 to strange type. system("wget2 --max-threads=70 --secure-protocol=PFS -q --base=\"https://www.uog.edu/directory/\"; -i testlistuog"); testlist.uog has lines ?page=01 ?page=02 --- ?page=64 But could change if they add more or remove some currently 633 records. Some lines in the file are over 25000 characters?? Total download is about 13M. The actual lines I need for the data are just 256K, so it has lots of junk (stuff I don't need for what I'm doing). Originally had if find where the UTF-8 characters where on line, and printed out the hex for the 2 or 3 byte strings. Then would print from that point in line using %10.10s since didn't need to see all lines?? But that causes the problem? But not sure why. Modified program to just print out the 2 or 3 byte UTF-8 character and file stays the same as original file. Then tried just using %s and it also stays a UTF-8 file?? But as I mentioned some lines are over 25000 character? Some lines have multiple UTF-8 characters, so perhaps the %10.10s was hitting in the middle of some UTF8 code? Contents of the main function. Not pretty, but works. FILE *fp1,*fp2; char line[32000],fileout[20]; unsigned char c1,c2,c3; size_t i; int j=0; if (argc<2) { printf("Need File name??"); exit(1); } fp1=fopen(argv[1],"r"); strcpy(fileout,argv[1]); strcat(fileout,".out"); fp2=fopen(fileout,"wb"); while(!feof(fp1)) { fgets(line,32000,fp1); line[strlen(line)-1]=0; j++; if(feof(fp1)) break; if(strlen(line)<3) continue; for(i=0;i<(strlen(line)-2);i++) { if(line[i]<=0) { c1=256+line[i]; c2=256+line[i+1]; c3=256+line[i+2]; if(c1!=194 && c1!=195 && c1!=196 && c1!=200) fprintf(fp2,"%5d %5ld %2.2x%2.2x%2.2x [%s]\n",j,(long)i, c1,c2,c3,&line[i]); else fprintf(fp2,"%5d %5ld %2.2x%2.2x [%s]\n",j,(long)i, c1,c2,&line[i]); if(c1!=194 && c1!=195 && c1!=196 && c1!=200) i++; i++; } } } fclose(fp1); fclose(fp2); return 0; Thanks again. Will try and figure what causes it to go from UTF-8?? Like I said, the pages have lots of weird lines. But get the data I need, and make a mariadb with the 633 records that can be sorted via php.. There are actually only 3 lines I use that have UTF-8 character - while the main file has 2000 lines with UTF-8 code. Guess atleast one of those lines caused the issue?? 13127 c3b1 [ña, Ph.D.;Crisostomo-Muña;Doreen;Professor of Accounting;School of Business & Public Administration;735-2501/20;doree...@triton.uog.edu] 13151 c3b1 [ña;Doreen;Professor of Accounting;School of Business & Public Administration;735-2501/20;doree...@triton.uog.edu] 27614 c3a5 [åni" Isidro;Isidro;Jaevani;Junior Web Developer;Office of Information Technology;735-2631;jisi...@triton.uog.edu] 34418 c381 [Álvarez-Piñer, Ph.D.;Madrid Álvarez-Piñer;Carlos;Director / Associate Professor of Spanish Pacific History;Micronesian Area Research Center;735-2156;madr...@triton.uog.edu] 34429 c3b1 [ñer, Ph.D.;Madrid Álvarez-Piñer;Carlos;Director / Associate Professor of Spanish Pacific History;Micronesian Area Research Center;735-2156;madr...@triton.uog.edu] 34448 c381 [Álvarez-Piñer;Carlos;Director / Associate Professor of Spanish Pacific History;Micronesian Area Research Center;735-2156;madr...@triton.uog.edu] 34459 c3b1 [ñer;Carlos;Director / Associate Professor of Spanish Pacific History;Micronesian Area Research Center;735-2156;madr...@triton.uog.edu] tried a number of things with iconv, but still ended with the problem format. Again, thanks for the time. > -- > Did 황준호 die? > ___ > users mailing list -- users@lists
Re: Having strange result on processing UTF-8 file
On 19/12/2021 02:15, Michael D. Setzer II via users wrote: Download 64 web pages into a single file using wget2. That is fine. One more thing. The single file you get is an html formatted file, yes? For the results that you want, and how you want to use it, do you really want html? If not, why don't you convert to plain text? Can we assume the 64 pages are always the same pages? -- Did 황준호 die? ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: Having strange result on processing UTF-8 file
On 19/12/2021 02:15, Michael D. Setzer II via users wrote: $ ./findnoascii2 allraw.uog Think this is the issue, but no ideal how to fix it. $ file allraw.uog.out allraw.uog.out: Non-ISO extended-ASCII text I assume findnoascii2 iswritten by you? Without knowing what it does (source), I think it would be hard for someone to diagnose. And you said you changed the encoding afterward, but you don't say how. -- Did 황준호 die? ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Having strange result on processing UTF-8 file
I've spent a number of hours trying all kinds of things I've found on web, but not getting anywhere. Probable something simple. Download 64 web pages into a single file using wget2. That is fine. file allraw.uog allraw.uog: HTML document, UTF-8 Unicode text, with very long lines File is about 13M (have no control of the source file) Have a simple CPP program that files lines that have special utf-8 characters. Would extract that code and printed output to screen directly and shows correct utf characters. But If I redirect file to file name and open it, many of the utf-8 characters show up as wrong extended ascii character for first byte and then weird code? Both in gedit and geany?? Modified program to write output directly to a file and if I use cat the output displays the correct utf-8, but again if I open file in gedit or geany it shows a a corrupted mix of extended ascii?? $ ./findnoascii2 allraw.uog Think this is the issue, but no ideal how to fix it. $ file allraw.uog.out allraw.uog.out: Non-ISO extended-ASCII text The file actually contains the correct utf-8 data, and looking at it with hexedit shows it, but both geany and gedit open the file as extended ASCII insteat of UTF-8. Changing the encoding afterward to UTF-8 does nothing. Don't se options? Again, probable something simple.. Thanks. Using cat to display out is fine. Line number position in line hexcode of first chacter then character and a file more characters. 1881 110 c2bb » 313483 c3a5 åhan Same lines from geany? 1881 110 c2bb » 313483 c3a5 Ã¥han Thanks... ___ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure