Hi, I started using Nutch with HBase several days ago following the Nutch2Tutorial shown below and it seemed to start working. http://wiki.apache.org/nutch/Nutch2Tutorial
Today, I noticed that page contents were cut down to 64KB. Actually, those pages are less than 64KB, but the contents are UTF-8, and multi-byte characters seem to be encoded like "\xE3\x81\x93" when stored in HBase, so basically the size becomes almost 4 times larger than that of the original content. Here are the questions: 1. How to fix this? I'm guessing changing the block size in HBase would fix the problem, but I don't know how. gora.properties, perhaps? 2. After fixing up the configurations, I need to fetch those incomplete pages again. Any easy way to do this? Any help would be appreciated. Thanks, Kaz

