malcolm smith wrote:
I am looking to create a parser for a groupware product that would read
pages message board type web site. (Think phpBB). But rather than creating
a single Content item which is parsed and indexed to a single lucene
document, I am planning to have the parser create a
I'm attempting to crawl pages with charset utf-16 and send the index to solr
where it can be searched. I followed the instructions
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ here and
successfully crawled and searched test content with utf-8. However, when I
attempt to crawl
My crawl always stops at depth=3. It gets documents but does not continue any
further.
Here is my nutch-site.xml
?xml version=1.0?
configuration
property
namehttp.agent.name/name
valuenutch-solr-integration/value
/property
property
namegenerate.max.per.host/name
value1000/value
/property
property
try
bin/nutch readdb crawl/crawldb -stats
are there any unfetched pages?
nutchcase schrieb:
My crawl always stops at depth=3. It gets documents but does not continue any
further.
Here is my nutch-site.xml
?xml version=1.0?
configuration
property
namehttp.agent.name/name
This is the error I keep getting whenever I try to fetch more than
400K files at a time using a 4 node hadoop cluster running nutch 1.0.
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed
to create file
Eric Osgood wrote:
This is the error I keep getting whenever I try to fetch more than 400K
files at a time using a 4 node hadoop cluster running nutch 1.0.
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to
create file
Andrzej,
I just downloaded the most recent trunk from svn as per your
recommendations for fixing the generate bug. As soon I have it all
rebuilt with my configs I will let you know how a crawl of ~1.6mln
pages goes. Hopefully no errors!
Thanks,
Eric
On Oct 20, 2009, at 2:13 PM, Andrzej
Thank you very much for the helpful reply, I'm back on track.
On Tue, Oct 20, 2009 at 2:01 AM, Andrzej Bialecki a...@getopt.org wrote:
malcolm smith wrote:
I am looking to create a parser for a groupware product that would read
pages message board type web site. (Think phpBB). But rather