OT: Analysing UTF-8 file contents.

Michael D. Setzer II via users Mon, 20 Dec 2021 20:36:46 -0800

Perhaps there is a utility or program that does this.
Have been working with web pages that have some utf-8
characters include. Came up with a program that process
files and creates report files that list all the lines and
positions that include utf-8 characaters. Then another file
that summaries each character with count. Then finally
will do the same and include the utf-8 description of the
character.


Found a list of all utf-8 2 byte 3 byte and 4 byte codes.
Turns out what I found was 122357 characters.
Unfortuntely, they were on pages that only listed around
a 1024? per page, so had to merge it all into a file that
turns out to be 4.4M in size....

Example of process.
  218544 allraw.uog     (combination of 64 web pages)
    2000 allraw.uog.out (contains a total of 2000 uft-8 characters)
      28 allraw.uog.out-sum (the 2000 character are 28 uniq ones)
      28 allraw.uog.out-sum2 (list with names)
     633 uog.csv                (I extract 633 lines of contact data)
       7 uog.csv.out    (Only 7 lines with utf-8 characters)
       3 uog.csv.out-sum (Only 3 uniq utf-8 characters
       3 uog.csv.out-sum2 (list with names)
  122357 utf-8codeslook.csv (4.4M file that has hex codes and des)

Example:
uog.csv.out
   131     27 c3b1       [ñ]
   131     51 c3b1       [ñ]
   276     14 c3a5       [å]
   344     18 c381       [Á]
   344     29 c3b1       [ñ]
   344     48 c381       [Á]
   344     59 c3b1       [ñ]
uog.csv.out-sum
      2 c381       [Á]
      1 c3a5       [å]
      4 c3b1       [ñ]
uog.csv.out-sum2
      2 c381       [Á]  LATIN CAPITAL LETTER A WITH ACUTE (U+00C1)
      1 c3a5       [å]  LATIN SMALL LETTER A WITH RING ABOVE (U+00E5)
      4 c3b1       [ñ]  LATIN SMALL LETTER N WITH TILDE (U+00F1)

Those are real simple.
The all file has 28 characters that include some strange ones.
      5 e2808b     []  ZERO WIDTH SPACE (U+200B)
      1 e28092     [‒]  FIGURE DASH (U+2012)
     44 e28093     [–]  EN DASH (U+2013)
      2 e28094     [—]  EM DASH (U+2014)
Not clear what the Zero Wicth Space is for?
The other 3 here all look the same to me??

Guess just needed to do something. Interesting result.
Progam takes the filename as input and creates the other 3 files.
If utf-8codeslook.csv in directory it creates the sum2 otherwise
skips it. Nice to have long description on some?
File is 4.4M but compresses to 510K as .xz file.

Program findnoascii4.cpp
#include <cstdio>
#include <cstring>
#include <cctype>
#include <cstdlib>

using namespace std;
void testlook(char filename[20]);
int main(int argc,char* argv[])
{
        FILE *fp1,*fp2,*fp3;
        char line[32000],fileout[80],summary[120];
        char 
code[8],codedes[500],*p1,utf8[8],utf8xchar[8],filename[80],filename2[80];
        int count,x;
        unsigned char c1,c2,c3,c4;
        size_t i;
        int j=0;
        if (argc<2)
        {
                printf("Need File name??");
                exit(1);
        }
        fp1=fopen(argv[1],"r");
        strcpy(fileout,argv[1]);
        strcat(fileout,".out");
        fp2=fopen(fileout,"w");
        while(!feof(fp1))
        {
                fgets(line,32000,fp1);
                j++;
                if(feof(fp1)) break;
                if(strlen(line)<4) continue;
                for(i=0;i<(strlen(line)-3);i++)
                {
                        if(line[i]<=0)
                        {
                                        c1=256+line[i];
                                        c2=256+line[i+1];
                                        c3=256+line[i+2];
                                        c4=256+line[i+3];
                                        switch(c1)
                                        {
                                                case 194 ... 223:
                                                        fprintf(fp2,"%6d %6ld 
%2.2x%2.2x
[%c%c]\n",j,(long)i, c1,c2,c1,c2);
                                                        i++;
                                                        break;
                                                case 224 ... 239:
                                                        fprintf(fp2,"%6d %6ld
%2.2x%2.2x%2.2x     [%c%c%c]\n",j,(long)i, c1,c2,c3,c1,c2,c3);
                                                        i++;
                                                        i++;
                                                        break;
                                                case 240 ... 244:
                                                        fprintf(fp2,"%6d %6ld
%2.2x%2.2x%2.2x%2.2x   [%c%c%c%c]\n",j,(long)i, c1,c2,c3,c4,c1,c2,c3,c4);
                                                        i++;
                                                        i++;
                                                        i++;
                                                        break;
                                        }
                        }
                }
        }
        fclose(fp1); fclose(fp2);
        sprintf(summary,"cut -b 15-30 <%s | sort | uniq -c 
>%s-sum",fileout,fileout);
        system(summary);
        if(!((fp1=fopen("utf-8codeslook.csv","r")))) return 0;
        sprintf(summary,"%s-sum %s-sum2",fileout,fileout);
        sscanf(summary,"%s %s", filename,filename2);
        fp2=fopen(filename,"r");
        fp3=fopen(filename2,"w");
        while(!feof(fp2))
        {
                x=fscanf(fp2,"%d %s %s",&count,utf8,utf8xchar);
                if(x<0) break;
                fp1=fopen("utf-8codeslook.csv","r");
                while(1)
                {
                        fscanf(fp1,"%[^;];%[^\n] ",code,codedes);
                        p1=strstr(code,utf8);
                        if(p1!=NULL) break;
                        if(feof(fp1)) break;
                }
                fprintf(fp3,"%7d %-10s %3s\t%s\n",count,code,utf8xchar,codedes);
                fclose(fp1);
        }
        fclose(fp2); fclose(fp3);
        return 0;
}

Perhaps someone else would find it useful, or perhaps
something exist that does something similar that I wasn't
able to find. Some mentioned have run across weird files
with utf-8. Seems to work for what I want.
Was fun figuring it out.
Thanks for your time.
Would be happy to make utf-8codeslook.xz file available
since it was a pain to add all the data from over 100
pages. Could find a single page with the data??
First 5 lines
c280;<control> (U+0080)
c281;<control> (U+0081)
c282;BREAK PERMITTED HERE (U+0082)
c283;NO BREAK HERE (U+0083)
c284;<control> (U+0084)
Some descriptions are almost 500 characters??

_______________________________________________
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

OT: Analysing UTF-8 file contents.

Reply via email to