Perhaps there is a utility or program that does this. Have been working with web pages that have some utf-8 characters include. Came up with a program that process files and creates report files that list all the lines and positions that include utf-8 characaters. Then another file that summaries each character with count. Then finally will do the same and include the utf-8 description of the character.
Found a list of all utf-8 2 byte 3 byte and 4 byte codes. Turns out what I found was 122357 characters. Unfortuntely, they were on pages that only listed around a 1024? per page, so had to merge it all into a file that turns out to be 4.4M in size.... Example of process. 218544 allraw.uog (combination of 64 web pages) 2000 allraw.uog.out (contains a total of 2000 uft-8 characters) 28 allraw.uog.out-sum (the 2000 character are 28 uniq ones) 28 allraw.uog.out-sum2 (list with names) 633 uog.csv (I extract 633 lines of contact data) 7 uog.csv.out (Only 7 lines with utf-8 characters) 3 uog.csv.out-sum (Only 3 uniq utf-8 characters 3 uog.csv.out-sum2 (list with names) 122357 utf-8codeslook.csv (4.4M file that has hex codes and des) Example: uog.csv.out 131 27 c3b1 [ñ] 131 51 c3b1 [ñ] 276 14 c3a5 [å] 344 18 c381 [Á] 344 29 c3b1 [ñ] 344 48 c381 [Á] 344 59 c3b1 [ñ] uog.csv.out-sum 2 c381 [Á] 1 c3a5 [å] 4 c3b1 [ñ] uog.csv.out-sum2 2 c381 [Á] LATIN CAPITAL LETTER A WITH ACUTE (U+00C1) 1 c3a5 [å] LATIN SMALL LETTER A WITH RING ABOVE (U+00E5) 4 c3b1 [ñ] LATIN SMALL LETTER N WITH TILDE (U+00F1) Those are real simple. The all file has 28 characters that include some strange ones. 5 e2808b [] ZERO WIDTH SPACE (U+200B) 1 e28092 [‒] FIGURE DASH (U+2012) 44 e28093 [–] EN DASH (U+2013) 2 e28094 [—] EM DASH (U+2014) Not clear what the Zero Wicth Space is for? The other 3 here all look the same to me?? Guess just needed to do something. Interesting result. Progam takes the filename as input and creates the other 3 files. If utf-8codeslook.csv in directory it creates the sum2 otherwise skips it. Nice to have long description on some? File is 4.4M but compresses to 510K as .xz file. Program findnoascii4.cpp #include <cstdio> #include <cstring> #include <cctype> #include <cstdlib> using namespace std; void testlook(char filename[20]); int main(int argc,char* argv[]) { FILE *fp1,*fp2,*fp3; char line[32000],fileout[80],summary[120]; char code[8],codedes[500],*p1,utf8[8],utf8xchar[8],filename[80],filename2[80]; int count,x; unsigned char c1,c2,c3,c4; size_t i; int j=0; if (argc<2) { printf("Need File name??"); exit(1); } fp1=fopen(argv[1],"r"); strcpy(fileout,argv[1]); strcat(fileout,".out"); fp2=fopen(fileout,"w"); while(!feof(fp1)) { fgets(line,32000,fp1); j++; if(feof(fp1)) break; if(strlen(line)<4) continue; for(i=0;i<(strlen(line)-3);i++) { if(line[i]<=0) { c1=256+line[i]; c2=256+line[i+1]; c3=256+line[i+2]; c4=256+line[i+3]; switch(c1) { case 194 ... 223: fprintf(fp2,"%6d %6ld %2.2x%2.2x [%c%c]\n",j,(long)i, c1,c2,c1,c2); i++; break; case 224 ... 239: fprintf(fp2,"%6d %6ld %2.2x%2.2x%2.2x [%c%c%c]\n",j,(long)i, c1,c2,c3,c1,c2,c3); i++; i++; break; case 240 ... 244: fprintf(fp2,"%6d %6ld %2.2x%2.2x%2.2x%2.2x [%c%c%c%c]\n",j,(long)i, c1,c2,c3,c4,c1,c2,c3,c4); i++; i++; i++; break; } } } } fclose(fp1); fclose(fp2); sprintf(summary,"cut -b 15-30 <%s | sort | uniq -c >%s-sum",fileout,fileout); system(summary); if(!((fp1=fopen("utf-8codeslook.csv","r")))) return 0; sprintf(summary,"%s-sum %s-sum2",fileout,fileout); sscanf(summary,"%s %s", filename,filename2); fp2=fopen(filename,"r"); fp3=fopen(filename2,"w"); while(!feof(fp2)) { x=fscanf(fp2,"%d %s %s",&count,utf8,utf8xchar); if(x<0) break; fp1=fopen("utf-8codeslook.csv","r"); while(1) { fscanf(fp1,"%[^;];%[^\n] ",code,codedes); p1=strstr(code,utf8); if(p1!=NULL) break; if(feof(fp1)) break; } fprintf(fp3,"%7d %-10s %3s\t%s\n",count,code,utf8xchar,codedes); fclose(fp1); } fclose(fp2); fclose(fp3); return 0; } Perhaps someone else would find it useful, or perhaps something exist that does something similar that I wasn't able to find. Some mentioned have run across weird files with utf-8. Seems to work for what I want. Was fun figuring it out. Thanks for your time. Would be happy to make utf-8codeslook.xz file available since it was a pain to add all the data from over 100 pages. Could find a single page with the data?? First 5 lines c280;<control> (U+0080) c281;<control> (U+0081) c282;BREAK PERMITTED HERE (U+0082) c283;NO BREAK HERE (U+0083) c284;<control> (U+0084) Some descriptions are almost 500 characters??
_______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure