Hi Kaz for incomplete pags, you can change file.content.limit property in nutch-site.xml.
maybe you can regenerate the urls and fetch again. On Sat, Jan 12, 2013 at 5:09 PM, k4200 <[email protected]> wrote: > Hi, > > I started using Nutch with HBase several days ago following the > Nutch2Tutorial shown below and it seemed to start working. > http://wiki.apache.org/nutch/Nutch2Tutorial > > Today, I noticed that page contents were cut down to 64KB. Actually, > those pages are less than 64KB, but the contents are UTF-8, and > multi-byte characters seem to be encoded like "\xE3\x81\x93" when > stored in HBase, so basically the size becomes almost 4 times larger > than that of the original content. > > Here are the questions: > 1. How to fix this? I'm guessing changing the block size in HBase > would fix the problem, but I don't know how. gora.properties, perhaps? > 2. After fixing up the configurations, I need to fetch those > incomplete pages again. Any easy way to do this? > > Any help would be appreciated. > > Thanks, > Kaz > -- Don't Grow Old, Grow Up... :-)

