subject:"Having strange result on processing UTF\-8 file"

Re: Having strange result on processing UTF-8 file

2021-12-18 Thread Tim via users

On Sat, 2021-12-18 at 20:40 -0500, Tom Horsley wrote:
> Just a (possibly) relevant note: I've seen many html pages with
> headers claiming they are UTF-8, but text that only displays
> correctly if you treat them as one of the windows code pages.
> 
> Worse yet, some browsers have heuristics to detect this and display
> them "correctly", so the creators usually never notice.

Supplementary:  Some webpages use a mixture of different encoding,
because the author has just dragged crap in from various sources
without even considering character encoding.
 
-- 
 
uname -rsvp
Linux 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 30 15:51:32 UTC 2021 x86_64
 
Boilerplate:  All unexpected mail to my mailbox is automatically deleted.
I will only get to see the messages that are posted to the mailing list.
 
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

Re: Having strange result on processing UTF-8 file

2021-12-18 Thread Ed Greshko


On 19/12/2021 09:50, Michael D. Setzer II wrote:

%10.10s and
%20.20s both would cause the problem.


I believe those are both printf format indicators. Which is why I was wondering 
if converting to plain text would be better
because those would be removed (dealt with) during the convert.

--
Did 황준호 die?
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

Re: Having strange result on processing UTF-8 file

2021-12-18 Thread Michael D. Setzer II via users

On 19 Dec 2021 at 9:14, Ed Greshko wrote:

From:   Ed Greshko 
Date sent:  Sun, 19 Dec 2021 09:14:37 +0800
Subject:Re: Having strange result on processing UTF-8 file
To: "Michael D. Setzer II" ,
Community support for Fedora users 
Send reply to:  Community support for Fedora users 

> On 19/12/2021 08:31, Michael D. Setzer II wrote:
> 
> But could change if they add more or remove some 
> currently 633 records. Some lines in the file are over 
> 25000 characters?? Total download is about 13M.
> The actual lines I need for the data are just 256K, so it 
> has lots of junk (stuff I don't need for what I'm doing).
> 
> That 13M file. Does it contain html? If so, would it be easier 
> to work with if it was converted to plain text?

Yes, they are all html pages, but some of the UTF-8 
characters don't match to a plain text charter and it is the 
name field. Did figure out the issue. %10.10s and 
%20.20s both would cause the problem. So I used the 
head command to pull various number of lines until I 
found where the file went Non-ISO extended ascii.
Was only a few lines that caused issue, and it was the last 
character in substring being a character above 127.

So added these commands to copy 30 characters from the 
point, but would then go from end and if last character 
was >127 change it to null
strcpy(linex,&line[i]);
linex[30]=0;
while(linex[strlen(linex)-1]>127) linex[strlen(linex)-1]=0;

The used %s and just printed linex. 
218544 lines in allraw.uog
1898 lines in allraw.uog.out (lines with utf-8)

The uog.csv has 633 lines but only these 3 have utf-8
  13127 c3b1 [ña, Ph.D.;Crisostomo-Muña;Do]
  13151 c3b1 [ña;Doreen;Professor of Accoun]
  27614 c3a5 [åni" Isidro;Isidro;Jaevani;Ju]
  34418 c381 [Álvarez-Piñer, Ph.D.;Madrid ]
  34429 c3b1 [ñer, Ph.D.;Madrid Álvarez-Pi]
  34448 c381 [Álvarez-Piñer;Carlos;Directo]
  34459 c3b1 [ñer;Carlos;Director / Associa]

Whole web page has a lot of other utf-8 characters.

Thanks again.

> --
> Did 황준호 die?

___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

Re: Having strange result on processing UTF-8 file

2021-12-18 Thread Tom Horsley

On Sun, 19 Dec 2021 09:14:37 +0800
Ed Greshko wrote:

> Does it contain html?

Just a (possibly) relevant note: I've seen many html pages with headers
claiming they are UTF-8, but text that only displays correctly if you
treat them as one of the windows code pages.

Worse yet, some browsers have heuristics to detect this and display them
"correctly", so the creators usually never notice.
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

Re: Having strange result on processing UTF-8 file

2021-12-18 Thread Ed Greshko


On 19/12/2021 08:31, Michael D. Setzer II wrote:


But could change if they add more or remove some currently 633 records. Some 
lines in the file are over 25000 characters?? Total download is about 13M.
The actual lines I need for the data are just 256K, so it has lots of junk 
(stuff I don't need for what I'm doing).


That 13M file.  Does it contain html?  If so, would it be easier to work with 
if it was converted to plain text?

--
Did 황준호 die?
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

Re: Having strange result on processing UTF-8 file

2021-12-18 Thread Michael D. Setzer II via users

On 19 Dec 2021 at 7:54, Ed Greshko wrote:

From:   Ed Greshko 
Date sent:  Sun, 19 Dec 2021 07:54:31 +0800
Subject:Re: Having strange result on processing
UTF-8 file
To: users@lists.fedoraproject.org
Send reply to:  Community support for Fedora
users 

> On 19/12/2021 02:15, Michael D. Setzer II via users wrote:
> > Download 64 web pages into a single file using wget2. That is fine.
>
> One more thing.
>
> The single file you get is an html formatted file, yes?  For the results that 
> you want, and how you want to
> use it, do you really want html?  If not, why don't you convert to plain text?
>
> Can we assume the 64 pages are always the same pages?
>
Yes. Figured a work around, but not exactly sure that the
issue is that changes the file from UTF-8 to strange type.
system("wget2 --max-threads=70 --secure-protocol=PFS -q 
--base=\"https://www.uog.edu/directory/\";
-i testlistuog");
testlist.uog has lines
?page=01
?page=02
---
?page=64

But could change if they add more or remove some
currently 633 records. Some lines in the file are over
25000 characters?? Total download is about 13M.
The actual lines I need for the data are just 256K, so it
has lots of junk (stuff I don't need for what I'm doing).

Originally had if find where the UTF-8 characters where
on line, and printed out the hex for the 2 or 3 byte
strings. Then would print from that point in line using
%10.10s since didn't need to see all lines?? But that
causes the problem? But not sure why.

Modified program to just print out the 2 or 3 byte UTF-8
character and file stays the same as original file. Then
tried just using %s and it also stays a UTF-8 file?? But as
I mentioned some lines are over 25000 character? Some
lines have multiple UTF-8 characters, so perhaps the
%10.10s was hitting in the middle of some UTF8 code?

Contents of the main function. Not  pretty, but works.

FILE *fp1,*fp2;
char line[32000],fileout[20];
unsigned char c1,c2,c3;
size_t i;
int j=0;
if (argc<2)
{
printf("Need File name??");
exit(1);
}
fp1=fopen(argv[1],"r");
strcpy(fileout,argv[1]);
strcat(fileout,".out");
fp2=fopen(fileout,"wb");
while(!feof(fp1))
{
fgets(line,32000,fp1);
line[strlen(line)-1]=0;
j++;
if(feof(fp1)) break;
if(strlen(line)<3) continue;
for(i=0;i<(strlen(line)-2);i++)
{
if(line[i]<=0)
{
c1=256+line[i];
c2=256+line[i+1];
c3=256+line[i+2];
if(c1!=194 && c1!=195 && c1!=196 && c1!=200)
fprintf(fp2,"%5d %5ld %2.2x%2.2x%2.2x   [%s]\n",j,(long)i,
c1,c2,c3,&line[i]);
else
fprintf(fp2,"%5d %5ld %2.2x%2.2x [%s]\n",j,(long)i,
c1,c2,&line[i]);
if(c1!=194 && c1!=195 && c1!=196 && c1!=200) i++;
i++;
}
}
}
fclose(fp1); fclose(fp2);
return 0;

Thanks again. Will try and figure what causes it to go
from UTF-8?? Like I said, the pages have lots of weird
lines. But get the data I need, and make a mariadb with
the 633 records that can be sorted via php..
There are actually only 3 lines I use that have UTF-8
character - while the main file has 2000 lines with UTF-8
code. Guess atleast one of those lines caused the issue??

  13127 c3b1 [ña, Ph.D.;Crisostomo-Muña;Doreen;Professor of 
Accounting;School of Business &
Public Administration;735-2501/20;doree...@triton.uog.edu]
  13151 c3b1 [ña;Doreen;Professor of Accounting;School of Business & 
Public
Administration;735-2501/20;doree...@triton.uog.edu]
  27614 c3a5 [åni" Isidro;Isidro;Jaevani;Junior Web Developer;Office of 
Information
Technology;735-2631;jisi...@triton.uog.edu]
  34418 c381 [Álvarez-Piñer, Ph.D.;Madrid Álvarez-Piñer;Carlos;Director 
/ Associate Professor of
Spanish Pacific History;Micronesian Area Research 
Center;735-2156;madr...@triton.uog.edu]
  34429 c3b1 [ñer, Ph.D.;Madrid Álvarez-Piñer;Carlos;Director / 
Associate Professor of Spanish
Pacific History;Micronesian Area Research 
Center;735-2156;madr...@triton.uog.edu]
  34448 c381 [Álvarez-Piñer;Carlos;Director / Associate Professor of 
Spanish Pacific
History;Micronesian Area Research Center;735-2156;madr...@triton.uog.edu]
  34459 c3b1 [ñer;Carlos;Director / Associate Professor of Spanish 
Pacific History;Micronesian
Area Research Center;735-2156;madr...@triton.uog.edu]

tried a number of things with iconv, but still ended with
the problem format.

Again, thanks for the time.

> --
> Did 황준호 die?
> ___
> users mailing list -- users@lists

Re: Having strange result on processing UTF-8 file

2021-12-18 Thread Ed Greshko


On 19/12/2021 02:15, Michael D. Setzer II via users wrote:

Download 64 web pages into a single file using wget2. That is fine.


One more thing.

The single file you get is an html formatted file, yes?  For the results that 
you want, and how you want to
use it, do you really want html?  If not, why don't you convert to plain text?

Can we assume the 64 pages are always the same pages?

--
Did 황준호 die?
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

Re: Having strange result on processing UTF-8 file

2021-12-18 Thread Ed Greshko


On 19/12/2021 02:15, Michael D. Setzer II via users wrote:

$ ./findnoascii2 allraw.uog
Think this is the issue, but no ideal how to fix it.
$ file allraw.uog.out
allraw.uog.out: Non-ISO extended-ASCII text


I assume findnoascii2 iswritten by you?  Without knowing what it does (source), 
I think it would
be hard for someone to diagnose.

And you said you changed the encoding afterward, but you don't say how.

--
Did 황준호 die?
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

Having strange result on processing UTF-8 file

2021-12-18 Thread Michael D. Setzer II via users

I've spent a number of hours trying all kinds of things I've
found on web, but not getting anywhere. Probable
something simple.

Download 64 web pages into a single file using wget2.
That is fine.

file allraw.uog
allraw.uog: HTML document, UTF-8 Unicode text, with
very long lines
File is about 13M (have no control of the source file)
Have a simple CPP program that files lines that have
special utf-8 characters. Would extract that code and
printed output to screen directly and shows correct utf
characters. But If I redirect file to file name and open it,
many of the utf-8 characters show up as wrong extended
ascii character for first byte and then weird code? Both in
gedit and geany??
Modified program to write output directly to a file and if I
use cat the output displays the correct utf-8, but again if I
open file in gedit or geany it shows a a corrupted mix of
extended ascii??

$ ./findnoascii2 allraw.uog
Think this is the issue, but no ideal how to fix it.
$ file allraw.uog.out
allraw.uog.out: Non-ISO extended-ASCII text

The file actually contains the correct utf-8 data, and
looking at it with hexedit shows it, but both geany and
gedit open the file as extended ASCII insteat of UTF-8.
Changing the encoding afterward to UTF-8 does nothing.
Don't se options? Again, probable something simple..

Thanks.
Using cat to display out is fine.
Line number position in line hexcode of first chacter then
character and a file more characters.
 1881   110 c2bb   » 
 313483 c3a5   åhan

Same lines from geany?
 1881   110 c2bb   Â» 
 313483 c3a5   Ã¥han

Thanks...
___
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

Re: Having strange result on processing UTF-8 file

Re: Having strange result on processing UTF-8 file

Re: Having strange result on processing UTF-8 file

Re: Having strange result on processing UTF-8 file

Re: Having strange result on processing UTF-8 file

Re: Having strange result on processing UTF-8 file

Re: Having strange result on processing UTF-8 file

Re: Having strange result on processing UTF-8 file

Having strange result on processing UTF-8 file

9 matches

Site Navigation

Mail list logo

Footer information