Hi,

I started using Nutch with HBase several days ago following the
Nutch2Tutorial shown below and it seemed to start working.
http://wiki.apache.org/nutch/Nutch2Tutorial

Today, I noticed that page contents were cut down to 64KB. Actually,
those pages are less than 64KB, but the contents are UTF-8, and
multi-byte characters seem to be encoded like "\xE3\x81\x93" when
stored in HBase, so basically the size becomes almost 4 times larger
than that of the original content.

Here are the questions:
1. How to fix this? I'm guessing changing the block size in HBase
would fix the problem, but I don't know how. gora.properties, perhaps?
2. After fixing up the configurations, I need to fetch those
incomplete pages again. Any easy way to do this?

Any help would be appreciated.

Thanks,
Kaz

Reply via email to