Hi Michael,
from the arguments I guess you're interested in the raw/binary HTML content,
right?
After a closer look I have no simple answer:
1. HTML has no fix encoding - it could be everything, pageA may have a
different
encoding than pageB.
2. That's different for parsed text: it's a Java String internally
3. "readseg dump" converts all data to a Java String using the default platform
encoding. On Linux having these locales installed you may get different
results for:
LC_ALL=en_US.utf8 ./bin/nutch reaseg -dump
LC_ALL=en_US ./bin/nutch reaseg -dump
LC_ALL=ru_RU ./bin/nutch reaseg -dump
In doubt, try to set UTF-8 to your platform encoding. Most pages nowadays
are UTF-8.
Btw., this behavior isn't ideal, it should be fixed as part NUTCH-1807.
4. a more reliable solution would require to detect the HTML encoding (the
code is available
in Nutch) and then convert the byte[] content using the right encoding.
Best,
Sebastian
On 11/15/2017 02:20 AM, Michael Coffey wrote:
> Greetings Nutchlings,
> I have been using readseg-dump successfully to retrieve content crawled by
> nutch, but I have one significant problem: many non-ASCII characters appear
> as '???' in the dumped text file. This happens fairly frequently in the
> headlines of news sites that I crawl, for things like quotes, apostrophes,
> and dashes.
> Am I doing something wrong, or is this a known bug? I use a python utf8
> decoder, so it would be nice if everything were UTF8.
> Here is the command that I use to dump each segment (using nutch
> 1.12).bin/nutch readseg -dumpĀ segPath destPath -noparse -noparsedata
> -noparsetext -nogenerate
> It is so close to working perfectly!
>