Re: Extending HTML Parser to create subpage index documents

2009-10-20 Thread Andrzej Bialecki
malcolm smith wrote: I am looking to create a parser for a groupware product that would read pages message board type web site. (Think phpBB). But rather than creating a single Content item which is parsed and indexed to a single lucene document, I am planning to have the parser create a

Nutch crawler charset issues utf-16

2009-10-20 Thread John_C_3
I'm attempting to crawl pages with charset utf-16 and send the index to solr where it can be searched. I followed the instructions http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ here and successfully crawled and searched test content with utf-8. However, when I attempt to crawl

crawl always stops at depth=3

2009-10-20 Thread nutchcase
My crawl always stops at depth=3. It gets documents but does not continue any further. Here is my nutch-site.xml ?xml version=1.0? configuration property namehttp.agent.name/name valuenutch-solr-integration/value /property property namegenerate.max.per.host/name value1000/value /property property

Re: crawl always stops at depth=3

2009-10-20 Thread reinhard schwab
try bin/nutch readdb crawl/crawldb -stats are there any unfetched pages? nutchcase schrieb: My crawl always stops at depth=3. It gets documents but does not continue any further. Here is my nutch-site.xml ?xml version=1.0? configuration property namehttp.agent.name/name

ERROR: current leaseholder is trying to recreate file.

2009-10-20 Thread Eric Osgood
This is the error I keep getting whenever I try to fetch more than 400K files at a time using a 4 node hadoop cluster running nutch 1.0. org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file

Re: ERROR: current leaseholder is trying to recreate file.

2009-10-20 Thread Andrzej Bialecki
Eric Osgood wrote: This is the error I keep getting whenever I try to fetch more than 400K files at a time using a 4 node hadoop cluster running nutch 1.0. org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file

Re: ERROR: current leaseholder is trying to recreate file.

2009-10-20 Thread Eric Osgood
Andrzej, I just downloaded the most recent trunk from svn as per your recommendations for fixing the generate bug. As soon I have it all rebuilt with my configs I will let you know how a crawl of ~1.6mln pages goes. Hopefully no errors! Thanks, Eric On Oct 20, 2009, at 2:13 PM, Andrzej

Re: Extending HTML Parser to create subpage index documents

2009-10-20 Thread malcolm smith
Thank you very much for the helpful reply, I'm back on track. On Tue, Oct 20, 2009 at 2:01 AM, Andrzej Bialecki a...@getopt.org wrote: malcolm smith wrote: I am looking to create a parser for a groupware product that would read pages message board type web site. (Think phpBB). But rather