Re: Nutch 1.7 - deleting segments

2014-05-03 Thread remi tassing
you are correct On Fri, May 2, 2014 at 7:46 PM, chethan chethan.p...@gmail.com wrote: Hi, I have a Nutch crawl with 4 segments which are fully indexed using the bin/nutch solrindexcommand. Now I'm all out of storage on the box, so can I delete the 4 segments and retain only the crawldb

Re: Nutch 1.7 - deleting segments

2014-05-03 Thread chethan
Thanks for your reply! Regards, -- Chethan Prasad On Sat, May 3, 2014 at 12:22 PM, remi tassing tassingr...@gmail.com wrote: you are correct On Fri, May 2, 2014 at 7:46 PM, chethan chethan.p...@gmail.com wrote: Hi, I have a Nutch crawl with 4 segments which are fully indexed using

Nutch 1.8 Solrindexer failing

2014-05-03 Thread BlackIce
HI, playing around with Nutch 1.8 in localmode on Solr 4.7.. When indexing larger crawls 10k and up I get: Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)

Re: Nutch 1.8 Solrindexer failing

2014-05-03 Thread remi tassing
Could you provide the complete stack trace? Probably add more debug info in. This could be due to some disk size issue... On Sat, May 3, 2014 at 8:51 PM, BlackIce blackice...@gmail.com wrote: HI, playing around with Nutch 1.8 in localmode on Solr 4.7.. When indexing larger crawls 10k and up

Re: Nutch 1.8 Solrindexer failing

2014-05-03 Thread BlackIce
Bad Request request: http://localhost:8983/solr/update?wt=javabinversion=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at

Re: Nutch 1.7 - deleting segments

2014-05-03 Thread John Lafitte
What would be the case where you would want to keep the segments? I'm considering automatically deleting them after sending the data to solr On May 3, 2014 2:29 AM, chethan chethan.p...@gmail.com wrote: Thanks for your reply! Regards, -- Chethan Prasad On Sat, May 3, 2014 at 12:22 PM,

Re: Nutch 1.8 in pseudo dist error

2014-05-03 Thread Sebastian Nagel
Hi, looks like the segment is not addressed properly: hdfs://localhost:54310/user/hduser/TestCrawl/segments/crawl_generate Segments are named by a time-stamp, e.g. .../TestCrawl/segments/20140502231126/ crawl_generate is a subdir. Can you specify the exact commands to run the crawler?

Re: Nutch 1.8 in pseudo dist error

2014-05-03 Thread BlackIce
same as for Nutch 2.2.1 in pseudo bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 10 from within the deploy dir. However, i remember reading somewhere that the deploy execution for the 1.x series is different than the 2.x series, that some more files, asides the seed.txt had to be

Nutch + GATE on Amazon EMR

2014-05-03 Thread chethan
I have setup Nutch to crawl on Amazon EMR and I have a plugin that uses GATEhttps://gate.ac.uk/ for text processing in the Indexing filters. GATE requires certain static resources (some xmls and text files) to be loaded for it to be initialized. I tried to bundle these resources in the job jar and