Re: readseg dump and non-ASCII characters

Michael Coffey Wed, 15 Nov 2017 17:29:26 -0800

Thanks for the note, Sebastian. Yes, it is the fetched HTML that I parse using 
python-based tools after getting it from readseg. This is an alternative I 
decided to use after having struggled with raw-binary-content and solr.
I figured it was a problem of readseg either decoding or encoding properly, but 
I didn't know which. Your point #3 seems to say it's the decode that goes wrong 
becasue it doesn't consider the encoding of the fetched page.

A follow-up: I don't quite understand how the "LC_ALL=en_US.utf8" would apply 
to a Hadoop job. Does it somehow propagate to all nodes in the cluster? Would 
it work just as well, or better, to use "-Dfile.encoding=UTF8" in the binNutch 
command?

      From: Sebastian Nagel <[email protected]>
 To: [email protected] 
 Sent: Wednesday, November 15, 2017 5:18 AM
 Subject: Re: readseg dump and non-ASCII characters

Hi Michael,

from the arguments I guess you're interested in the raw/binary HTML content, 
right?
After a closer look I have no simple answer:

 1. HTML has no fix encoding - it could be everything, pageA may have a 
different
    encoding than pageB.

 2. That's different for parsed text: it's a Java String internally

 3. "readseg dump" converts all data to a Java String using the default platform
    encoding. On Linux having these locales installed you may get different 
results for:
      LC_ALL=en_US.utf8  ./bin/nutch reaseg -dump
      LC_ALL=en_US      ./bin/nutch reaseg -dump
      LC_ALL=ru_RU      ./bin/nutch reaseg -dump
    In doubt, try to set UTF-8 to your platform encoding. Most pages nowadays 
are UTF-8.
    Btw., this behavior isn't ideal, it should be fixed as part NUTCH-1807.

 4. a more reliable solution would require to detect the HTML encoding (the 
code is available
    in Nutch) and then convert the byte[] content using the right encoding.

Best,
Sebastian

On 11/15/2017 02:20 AM, Michael Coffey wrote:
> Greetings Nutchlings,
> I have been using readseg-dump successfully to retrieve content crawled by 
> nutch, but I have one significant problem: many non-ASCII characters appear 
> as '???' in the dumped text file. This happens fairly frequently in the 
> headlines of news sites that I crawl, for things like quotes, apostrophes, 
> and dashes.
> Am I doing something wrong, or is this a known bug? I use a python utf8 
> decoder, so it would be nice if everything were UTF8.
> Here is the command that I use to dump each segment (using nutch 
> 1.12).bin/nutch readseg -dump  segPath destPath -noparse -noparsedata 
> -noparsetext -nogenerate
> It is so close to working perfectly!
>

Re: readseg dump and non-ASCII characters

Reply via email to