Re: readseg dump and non-ASCII characters

Michael Coffey Thu, 14 Dec 2017 10:31:05 -0800

Not sure it's practical to go around to all the hadoop machines and change 
their default encoding settings. Not sure it wouldn't break something else!


I'm wondering if there's a simple fix I could make to the source code to make 
nutch.segment.SegmentReader use utf-8 as a default when reading the segment 
data.



In SegmentReader.java, the only obvious file-reading code I see is in this 
append function.
  private int append(FileSystem fs, Configuration conf, Path src,
      PrintWriter writer, int currentRecordNumber) throws IOException {
    BufferedReader reader = new BufferedReader(new InputStreamReader(
        fs.open(src)));
    try {
      String line = reader.readLine();
      while (line != null) {
        if (line.startsWith("Recno:: ")) {
          line = "Recno:: " + currentRecordNumber++;
        }
        writer.println(line);
        line = reader.readLine();
      }
      return currentRecordNumber;
    } finally {
      reader.close();
    }
  }


SegmentReader has three different lines that create an OutputStreamWriter. Two 
of those explicitly use "UTF-8", but the one that creates a PrintWriter 
implicitly uses default encoding.

If I insert a "UTF-8" arg into the InputStreamReader and OutputStreamWriter 
constructors, should that work? Is it likely to break something else?








________________________________
From: Sebastian Nagel <[email protected]>
To: [email protected] 
Sent: Wednesday, November 15, 2017 5:18 AM
Subject: Re: readseg dump and non-ASCII characters



Hi Michael,

from the arguments I guess you're interested in the raw/binary HTML content, 
right?
After a closer look I have no simple answer:

1. HTML has no fix encoding - it could be everything, pageA may have a different
    encoding than pageB.

2. That's different for parsed text: it's a Java String internally

3. "readseg dump" converts all data to a Java String using the default platform
    encoding. On Linux having these locales installed you may get different 
results for:
       LC_ALL=en_US.utf8  ./bin/nutch reaseg -dump
       LC_ALL=en_US       ./bin/nutch reaseg -dump
       LC_ALL=ru_RU       ./bin/nutch reaseg -dump
    In doubt, try to set UTF-8 to your platform encoding. Most pages nowadays 
are UTF-8.
    Btw., this behavior isn't ideal, it should be fixed as part NUTCH-1807.

4. a more reliable solution would require to detect the HTML encoding (the code 
is available
    in Nutch) and then convert the byte[] content using the right encoding.

Best,
Sebastian




On 11/15/2017 02:20 AM, Michael Coffey wrote:
> Greetings Nutchlings,
> I have been using readseg-dump successfully to retrieve content crawled by 
> nutch, but I have one significant problem: many non-ASCII characters appear 
> as '???' in the dumped text file. This happens fairly frequently in the 
> headlines of news sites that I crawl, for things like quotes, apostrophes, 
> and dashes.
> Am I doing something wrong, or is this a known bug? I use a python utf8 
> decoder, so it would be nice if everything were UTF8.
> Here is the command that I use to dump each segment (using nutch 
> 1.12).bin/nutch readseg -dump  segPath destPath -noparse -noparsedata 
> -noparsetext -nogenerate
> It is so close to working perfectly!
>

Re: readseg dump and non-ASCII characters

Reply via email to