Hi Kaz

for incomplete pags, you can change file.content.limit property in
nutch-site.xml.

maybe you can regenerate the urls and fetch again.




On Sat, Jan 12, 2013 at 5:09 PM, k4200 <[email protected]> wrote:

> Hi,
>
> I started using Nutch with HBase several days ago following the
> Nutch2Tutorial shown below and it seemed to start working.
> http://wiki.apache.org/nutch/Nutch2Tutorial
>
> Today, I noticed that page contents were cut down to 64KB. Actually,
> those pages are less than 64KB, but the contents are UTF-8, and
> multi-byte characters seem to be encoded like "\xE3\x81\x93" when
> stored in HBase, so basically the size becomes almost 4 times larger
> than that of the original content.
>
> Here are the questions:
> 1. How to fix this? I'm guessing changing the block size in HBase
> would fix the problem, but I don't know how. gora.properties, perhaps?
> 2. After fixing up the configurations, I need to fetch those
> incomplete pages again. Any easy way to do this?
>
> Any help would be appreciated.
>
> Thanks,
> Kaz
>



-- 
Don't Grow Old, Grow Up... :-)

Reply via email to