you are correct
On Fri, May 2, 2014 at 7:46 PM, chethan chethan.p...@gmail.com wrote:
Hi,
I have a Nutch crawl with 4 segments which are fully indexed using the
bin/nutch
solrindexcommand. Now I'm all out of storage on the box, so can I delete
the 4 segments and retain only the crawldb
Thanks for your reply!
Regards,
--
Chethan Prasad
On Sat, May 3, 2014 at 12:22 PM, remi tassing tassingr...@gmail.com wrote:
you are correct
On Fri, May 2, 2014 at 7:46 PM, chethan chethan.p...@gmail.com wrote:
Hi,
I have a Nutch crawl with 4 segments which are fully indexed using
HI, playing around with Nutch 1.8 in localmode on Solr 4.7..
When indexing larger crawls 10k and up I get:
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
Could you provide the complete stack trace? Probably add more debug info in.
This could be due to some disk size issue...
On Sat, May 3, 2014 at 8:51 PM, BlackIce blackice...@gmail.com wrote:
HI, playing around with Nutch 1.8 in localmode on Solr 4.7..
When indexing larger crawls 10k and up
Bad Request
request: http://localhost:8983/solr/update?wt=javabinversion=2
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at
What would be the case where you would want to keep the segments? I'm
considering automatically deleting them after sending the data to solr
On May 3, 2014 2:29 AM, chethan chethan.p...@gmail.com wrote:
Thanks for your reply!
Regards,
--
Chethan Prasad
On Sat, May 3, 2014 at 12:22 PM,
Hi,
looks like the segment is not addressed properly:
hdfs://localhost:54310/user/hduser/TestCrawl/segments/crawl_generate
Segments are named by a time-stamp, e.g.
.../TestCrawl/segments/20140502231126/
crawl_generate is a subdir.
Can you specify the exact commands to run the crawler?
same as for Nutch 2.2.1 in pseudo
bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 10
from within the deploy dir.
However, i remember reading somewhere that the deploy execution for the 1.x
series is different than the 2.x series, that some more files, asides the
seed.txt had to be
I have setup Nutch to crawl on Amazon EMR and I have a plugin that
uses GATEhttps://gate.ac.uk/ for
text processing in the Indexing filters. GATE requires certain static
resources (some xmls and text files) to be loaded for it to be initialized.
I tried to bundle these resources in the job jar and
9 matches
Mail list logo