Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.
Hi Fred, Please ensure that the linkdb command was executed succesfully. The output logs do not indicate this. Looks like you've got a '-' minus character in from of the relative linkdb directory as well. HTH On Wed, Oct 26, 2011 at 1:27 AM, Fred Zimmerman zimzaz@gmail.comwrote: I'm still having trouble with this in 1.3. looks as if there's something dumb with syntax or file structure but can't get it. $ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb -linkdb crawl/linkdb crawl/segments/* SolrIndexer: starting at 2011-10-25 23:26:02 org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/crawl_fetch Input path does not exist: file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/crawl_parse Input path does not exist: file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/parse_data Input path does not exist: file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/parse_text Input path does not exist: file:/home/bitnami/nutch-1.3/runtime/local/-linkdb/current On Tue, Oct 25, 2011 at 12:49 PM, Markus Jelsma markus.jel...@openindex.iowrote: From the changelog: http://svn.apache.org/viewvc/nutch/trunk/CHANGES.txt?view=markup 111 * NUTCH-1054 LinkDB optional during indexing (jnioche) With your command, the given linkdb is interpreted as a segment. https://issues.apache.org/jira/browse/NUTCH-1054 This is the new command: Usage: SolrIndexer solr url crawldb [-linkdb linkdb] (segment ... | - dir segments) [-noCommit On Tuesday 25 October 2011 18:41:09 Bai Shen wrote: I'm having a similar issue. I'm using 1.4 and getting these errors with linkdb. The segments seem fine. 2011-10-25 10:10:20,060 INFO solr.SolrIndexer - SolrIndexer: starting at 2011-10-25 10:10:20 2011-10-25 10:10:20,110 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb 2011-10-25 10:10:20,110 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/linkdb 2011-10-25 10:10:20,136 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20111025095216 2011-10-25 10:10:20,138 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/2011102514 2011-10-25 10:10:20,207 ERROR solr.SolrIndexer - org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/nutch-1.4/runtime/local/crawl/linkdb/crawl_fetch Input path does not exist: file:/opt/nutch-1.4/runtime/local/crawl/linkdb/crawl_parse Input path does not exist: file:/opt/nutch-1.4/runtime/local/crawl/linkdb/parse_data Input path does not exist: file:/opt/nutch-1.4/runtime/local/crawl/linkdb/parse_text Did something change with 1.4? On Sun, Oct 9, 2011 at 6:15 AM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Fred, How many individual directories do you have under /runtime/local/crawl/segments/ ? Another thing that raises alarms is the nohup.out dir's! Are these intentional? Interestingly, missing segment data is not the same with these dir's. Does your log output indicate any discrepancies between various command transitions? bitnami@ip-10-202-202-68:~/nutch-1.3/nutch-1.3/runtime/local$ bin/nutch solrindex http://zimzazsearch3-1.bitnamiapp.com:8983/solr/crawl/crawldb crawl/linkdb crawl/segments/* SolrIndexer: starting at 2011-10-09 00:13:24 org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110 922143907/crawl_fetch Input path does not exist: file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110 922143907/crawl_parse Input path does not exist: file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110 922143907/parse_data Input path does not exist: file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110 922143907/parse_text Input path does not exist: file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110 922144329/crawl_fetch Input path does not exist: file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110 922144329/crawl_parse Input path does not exist: file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110 922144329/parse_data Input path does not exist: file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110 922144329/parse_text Input path does not exist: file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20111 008015309/crawl_parse Input path does not exist:
Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.
Besises, the -linkdb param is 1.4 not 1.3 that's what's wrong here. Bai explicitely mentioned 1.4 Hi Fred, Please ensure that the linkdb command was executed succesfully. The output logs do not indicate this. Looks like you've got a '-' minus character in from of the relative linkdb directory as well. HTH On Wed, Oct 26, 2011 at 1:27 AM, Fred Zimmerman zimzaz@gmail.comwrote: I'm still having trouble with this in 1.3. looks as if there's something dumb with syntax or file structure but can't get it. $ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb -linkdb crawl/linkdb crawl/segments/* SolrIndexer: starting at 2011-10-25 23:26:02 org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/crawl_fetch Input path does not exist: file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/crawl_parse Input path does not exist: file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/parse_data Input path does not exist: file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/parse_text Input path does not exist: file:/home/bitnami/nutch-1.3/runtime/local/-linkdb/current On Tue, Oct 25, 2011 at 12:49 PM, Markus Jelsma markus.jel...@openindex.iowrote: From the changelog: http://svn.apache.org/viewvc/nutch/trunk/CHANGES.txt?view=markup 111 * NUTCH-1054 LinkDB optional during indexing (jnioche) With your command, the given linkdb is interpreted as a segment. https://issues.apache.org/jira/browse/NUTCH-1054 This is the new command: Usage: SolrIndexer solr url crawldb [-linkdb linkdb] (segment ... - dir segments) [-noCommit On Tuesday 25 October 2011 18:41:09 Bai Shen wrote: I'm having a similar issue. I'm using 1.4 and getting these errors with linkdb. The segments seem fine. 2011-10-25 10:10:20,060 INFO solr.SolrIndexer - SolrIndexer: starting at 2011-10-25 10:10:20 2011-10-25 10:10:20,110 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb 2011-10-25 10:10:20,110 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/linkdb 2011-10-25 10:10:20,136 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20111025095216 2011-10-25 10:10:20,138 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/2011102514 2011-10-25 10:10:20,207 ERROR solr.SolrIndexer - org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/nutch-1.4/runtime/local/crawl/linkdb/crawl_fetch Input path does not exist: file:/opt/nutch-1.4/runtime/local/crawl/linkdb/crawl_parse Input path does not exist: file:/opt/nutch-1.4/runtime/local/crawl/linkdb/parse_data Input path does not exist: file:/opt/nutch-1.4/runtime/local/crawl/linkdb/parse_text Did something change with 1.4? On Sun, Oct 9, 2011 at 6:15 AM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Fred, How many individual directories do you have under /runtime/local/crawl/segments/ ? Another thing that raises alarms is the nohup.out dir's! Are these intentional? Interestingly, missing segment data is not the same with these dir's. Does your log output indicate any discrepancies between various command transitions? bitnami@ip-10-202-202-68:~/nutch-1.3/nutch-1.3/runtime/local$ bin/nutch solrindex http://zimzazsearch3-1.bitnamiapp.com:8983/solr/crawl/crawldb crawl/linkdb crawl/segments/* SolrIndexer: starting at 2011-10-09 00:13:24 org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201 10 922143907/crawl_fetch Input path does not exist: file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201 10 922143907/crawl_parse Input path does not exist: file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201 10 922143907/parse_data Input path does not exist: file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201 10 922143907/parse_text Input path does not exist: file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201 10 922144329/crawl_fetch Input path does not exist: file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201 10 922144329/crawl_parse Input path does not exist: file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201 10 922144329/parse_data Input
Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.
OK, I've fixed the problem with the parameters giving incorrect paths to the files. Now I get this: $ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb crawl/linkdb crawl/segments/* SolrIndexer: starting at 2011-10-26 12:57:57 java.io.IOException: Job failed!
Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.
that's it. org.apache.solr.common.SolrException: ERROR:unknown field 'content' *ERROR:unknown field 'content'* request: http://search.zimzaz.com:8983/solr/update?wt=javabinversion=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:436) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:245) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:82) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 2011-10-26 12:58:20,596 ERROR solr.SolrIndexer - java.io.IOException: Job failed! On Wed, Oct 26, 2011 at 9:03 AM, Markus Jelsma markus.jel...@openindex.iowrote: Check your hadoop.log and Solr log. If that happens there's usually i field mismatch when indexing. On Wednesday 26 October 2011 14:59:02 Fred Zimmerman wrote: OK, I've fixed the problem with the parameters giving incorrect paths to the files. Now I get this: $ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb crawl/linkdb crawl/segments/* SolrIndexer: starting at 2011-10-26 12:57:57 java.io.IOException: Job failed! -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.
Add the schema.xml from nutch/conf to your Solr core. btw: be careful with your host and port in the mailing lists. If it's open On Wednesday 26 October 2011 15:07:56 Fred Zimmerman wrote: that's it. org.apache.solr.common.SolrException: ERROR:unknown field 'content' *ERROR:unknown field 'content'* request: http://url/solr/update?wt=javabinversion=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp SolrServer.java:436) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp SolrServer.java:245) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Abstract UpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:82) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.ja va:48) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 2011-10-26 12:58:20,596 ERROR solr.SolrIndexer - java.io.IOException: Job failed! On Wed, Oct 26, 2011 at 9:03 AM, Markus Jelsma markus.jel...@openindex.iowrote: Check your hadoop.log and Solr log. If that happens there's usually i field mismatch when indexing. On Wednesday 26 October 2011 14:59:02 Fred Zimmerman wrote: OK, I've fixed the problem with the parameters giving incorrect paths to the files. Now I get this: $ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb crawl/linkdb crawl/segments/* SolrIndexer: starting at 2011-10-26 12:57:57 java.io.IOException: Job failed! -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.
I added just the content field ... I have already modified solr's schema.xml to accommodate some other data types. Now when starting solr ... INFO: SolrUpdateServlet.init() done 2011-10-26 13:29:50.849:INFO::Started SocketConnector@0.0.0.0:8983 2011-10-26 13:30:23.129:WARN::/solr/admin/ java.lang.IllegalStateException: STREAM at org.mortbay.jetty.Response.getWriter(Response.java:616) etc ... On Wed, Oct 26, 2011 at 9:16 AM, Markus Jelsma markus.jel...@openindex.iowrote: Add the schema.xml from nutch/conf to your Solr core. btw: be careful with your host and port in the mailing lists. If it's open On Wednesday 26 October 2011 15:07:56 Fred Zimmerman wrote: that's it. org.apache.solr.common.SolrException: ERROR:unknown field 'content' *ERROR:unknown field 'content'* request: http://url/solr/update?wt=javabinversion=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp SolrServer.java:436) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp SolrServer.java:245) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Abstract UpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:82) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.ja va:48) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 2011-10-26 12:58:20,596 ERROR solr.SolrIndexer - java.io.IOException: Job failed! On Wed, Oct 26, 2011 at 9:03 AM, Markus Jelsma markus.jel...@openindex.iowrote: Check your hadoop.log and Solr log. If that happens there's usually i field mismatch when indexing. On Wednesday 26 October 2011 14:59:02 Fred Zimmerman wrote: OK, I've fixed the problem with the parameters giving incorrect paths to the files. Now I get this: $ bin/nutch solrindex http://search.zimzaz.com:8983/solrcrawl/crawldb crawl/linkdb crawl/segments/* SolrIndexer: starting at 2011-10-26 12:57:57 java.io.IOException: Job failed! -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.
Hi Fred, These are clearly Solr aimed questions, which I would observe are specific to your schema. Maybe try the Solr archives for key words or else try the Solr user lists.I think that you are much more likely to get a substantiated response there. Thank you On Wed, Oct 26, 2011 at 3:31 PM, Fred Zimmerman zimzaz@gmail.comwrote: I added just the content field ... I have already modified solr's schema.xml to accommodate some other data types. Now when starting solr ... INFO: SolrUpdateServlet.init() done 2011-10-26 13:29:50.849:INFO::Started SocketConnector@0.0.0.0:8983 2011-10-26 13:30:23.129:WARN::/solr/admin/ java.lang.IllegalStateException: STREAM at org.mortbay.jetty.Response.getWriter(Response.java:616) etc ... On Wed, Oct 26, 2011 at 9:16 AM, Markus Jelsma markus.jel...@openindex.iowrote: Add the schema.xml from nutch/conf to your Solr core. btw: be careful with your host and port in the mailing lists. If it's open On Wednesday 26 October 2011 15:07:56 Fred Zimmerman wrote: that's it. org.apache.solr.common.SolrException: ERROR:unknown field 'content' *ERROR:unknown field 'content'* request: http://url/solr/update?wt=javabinversion=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp SolrServer.java:436) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp SolrServer.java:245) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Abstract UpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:82) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.ja va:48) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 2011-10-26 12:58:20,596 ERROR solr.SolrIndexer - java.io.IOException: Job failed! On Wed, Oct 26, 2011 at 9:03 AM, Markus Jelsma markus.jel...@openindex.iowrote: Check your hadoop.log and Solr log. If that happens there's usually i field mismatch when indexing. On Wednesday 26 October 2011 14:59:02 Fred Zimmerman wrote: OK, I've fixed the problem with the parameters giving incorrect paths to the files. Now I get this: $ bin/nutch solrindex http://search.zimzaz.com:8983/solrcrawl/crawldb crawl/linkdb crawl/segments/* SolrIndexer: starting at 2011-10-26 12:57:57 java.io.IOException: Job failed! -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 -- *Lewis*
Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.
will do. Of course I have already googled these terms without much luck. Fred On Wed, Oct 26, 2011 at 9:34 AM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Fred, These are clearly Solr aimed questions, which I would observe are specific to your schema. Maybe try the Solr archives for key words or else try the Solr user lists.I think that you are much more likely to get a substantiated response there. Thank you On Wed, Oct 26, 2011 at 3:31 PM, Fred Zimmerman zimzaz@gmail.com wrote: I added just the content field ... I have already modified solr's schema.xml to accommodate some other data types. Now when starting solr ... INFO: SolrUpdateServlet.init() done 2011-10-26 13:29:50.849:INFO::Started SocketConnector@0.0.0.0:8983 2011-10-26 13:30:23.129:WARN::/solr/admin/ java.lang.IllegalStateException: STREAM at org.mortbay.jetty.Response.getWriter(Response.java:616) etc ... On Wed, Oct 26, 2011 at 9:16 AM, Markus Jelsma markus.jel...@openindex.iowrote: Add the schema.xml from nutch/conf to your Solr core. btw: be careful with your host and port in the mailing lists. If it's open On Wednesday 26 October 2011 15:07:56 Fred Zimmerman wrote: that's it. org.apache.solr.common.SolrException: ERROR:unknown field 'content' *ERROR:unknown field 'content'* request: http://url/solr/update?wt=javabinversion=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp SolrServer.java:436) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp SolrServer.java:245) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Abstract UpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:82) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.ja va:48) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 2011-10-26 12:58:20,596 ERROR solr.SolrIndexer - java.io.IOException: Job failed! On Wed, Oct 26, 2011 at 9:03 AM, Markus Jelsma markus.jel...@openindex.iowrote: Check your hadoop.log and Solr log. If that happens there's usually i field mismatch when indexing. On Wednesday 26 October 2011 14:59:02 Fred Zimmerman wrote: OK, I've fixed the problem with the parameters giving incorrect paths to the files. Now I get this: $ bin/nutch solrindex http://search.zimzaz.com:8983/solrcrawl/crawldb crawl/linkdb crawl/segments/* SolrIndexer: starting at 2011-10-26 12:57:57 java.io.IOException: Job failed! -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 -- *Lewis*
Re: Segment cleanup
On Tue, Oct 25, 2011 at 1:25 PM, Markus Jelsma markus.jel...@openindex.iowrote: Is there a reason to keep a segment around after it's been indexed? When following the tutorial, I ended up sending the same segment to the solr server multiple times because I was using segments/* as my argument. Only send the segment(s) that have not been indexed yet unless you have to reindex everything each time. -nods- I didn't mean to send the same segment multiple times. I just didn't quite realize what the index command was doing. Once I've sent it to the solr server, is there any reason not to delete that segment? You can delete a segment if: - you don't do any reindexing; - it's older than fetch interval (default 30 days) and you are sure all URL's in that segment have already been fetched in newer segment(s); - don't need the stored content in the segment. What do you mean by reindexing? Doesn't nutch handle this with it's refetching of content after it expires? Once I parse the content and update the db, wouldn't the segment be irrelevant in regards to whether they get fetched or not? The content already gets stored in the solr server to facilitate highlighting. So I can't see why we would need to store it in nutch.
Re: Fwd: Understanding Nutch workflow
Gotcha. Maybe I'll see about starting a 1.4 version of the tutorial. Not sure if I'll have time, though. On Tue, Oct 25, 2011 at 2:14 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Thanks, this is now sorted out. For refernce, you can sign up and commit your own changes to the Nutch wiki. Thanks for pointing this out. lewis On Tue, Oct 25, 2011 at 6:43 PM, Bai Shen baishen.li...@gmail.com wrote: BTW, found a typo in the tutorial. It has the following. bin/nutch parse $1 And it should be this. bin/nutch parse $s1 On Tue, Sep 27, 2011 at 2:35 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: OK I understand. I think the main task with the documentation was that the fundamental architecture changed between Nutch 1.2 1.3. Generally speaking the community appears to understand this, but I fully recognise that there is still a good deal of work to do. Thanks for pointing this out. Lewis On Tue, Sep 27, 2011 at 7:32 PM, Bai Shen baishen.li...@gmail.com wrote: I'm not looking for anything to be created. It's just that a lot of the documentation seems to be marked as needing updates for 1.3 and I was wondering what the timeline for completing it was. On Tue, Sep 27, 2011 at 2:28 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, What documentation apart from what we have marked as in construction or TODO on the wiki would you like to see created? It has been a pretty long process getting these resources up-to-date, however we are getting there! BTW, do you know what the timeline is to have the documentation updated for 1.3? It is as we speak. Lewis did quite a good job for the wiki docs on 1.3. -- *Lewis* -- *Lewis* -- *Lewis*
Re: Segment cleanup
On Wednesday 26 October 2011 16:24:15 Bai Shen wrote: On Tue, Oct 25, 2011 at 1:25 PM, Markus Jelsma markus.jel...@openindex.iowrote: Is there a reason to keep a segment around after it's been indexed? When following the tutorial, I ended up sending the same segment to the solr server multiple times because I was using segments/* as my argument. Only send the segment(s) that have not been indexed yet unless you have to reindex everything each time. -nods- I didn't mean to send the same segment multiple times. I just didn't quite realize what the index command was doing. Once I've sent it to the solr server, is there any reason not to delete that segment? You can delete a segment if: - you don't do any reindexing; - it's older than fetch interval (default 30 days) and you are sure all URL's in that segment have already been fetched in newer segment(s); - don't need the stored content in the segment. What do you mean by reindexing? Doesn't nutch handle this with it's refetching of content after it expires? Once I parse the content and update the db, wouldn't the segment be irrelevant in regards to whether they get fetched or not? The content already gets stored in the solr server to facilitate highlighting. So I can't see why we would need to store it in nutch. Reindexing is useful for development environments or even production when Solr's index time analysis changes. If you change for tokens get indexed or enable of disable norms, you must reindex from scratch. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
1) success 2) how to tell Nutch index everything
1) I resolved the issues with solrindex. It turned out to be a matter of adding all the nutch schema-specific fields to solr's schema.xml. there was one gotcha which is that the latest solr schema does not have a default fieldtype text as in Nutch 1.3/schema.xml; you must use text_general. A comment for developers is that the use case of copying the nutch schema to overwrite the solr one only works for people who are beginning their indexing with a crawl. More detailed instructions on how to modify solr/schema.xml for nutch would be helpful, or better yet, a script to add the appropriate fields. 2) is there a way to tell Nutch to index everything at a given site? I am crawling a couple of my own sites and it seems rather clumsy just to give Nutch a big TopN. wouldn't an all value be helpful?
Re: Fwd: Understanding Nutch workflow
1.3 will cover 1.4. The main point was regarding the change in architecture when taking into consideration the new runtime directory structure which was introduced in Nutch 1.3. Feel free to join me on getting a Hadoop tutorial for 1.4. I'ts been on the agenda but somewhat shelved. On Wed, Oct 26, 2011 at 4:25 PM, Bai Shen baishen.li...@gmail.com wrote: Gotcha. Maybe I'll see about starting a 1.4 version of the tutorial. Not sure if I'll have time, though. On Tue, Oct 25, 2011 at 2:14 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Thanks, this is now sorted out. For refernce, you can sign up and commit your own changes to the Nutch wiki. Thanks for pointing this out. lewis On Tue, Oct 25, 2011 at 6:43 PM, Bai Shen baishen.li...@gmail.com wrote: BTW, found a typo in the tutorial. It has the following. bin/nutch parse $1 And it should be this. bin/nutch parse $s1 On Tue, Sep 27, 2011 at 2:35 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: OK I understand. I think the main task with the documentation was that the fundamental architecture changed between Nutch 1.2 1.3. Generally speaking the community appears to understand this, but I fully recognise that there is still a good deal of work to do. Thanks for pointing this out. Lewis On Tue, Sep 27, 2011 at 7:32 PM, Bai Shen baishen.li...@gmail.com wrote: I'm not looking for anything to be created. It's just that a lot of the documentation seems to be marked as needing updates for 1.3 and I was wondering what the timeline for completing it was. On Tue, Sep 27, 2011 at 2:28 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, What documentation apart from what we have marked as in construction or TODO on the wiki would you like to see created? It has been a pretty long process getting these resources up-to-date, however we are getting there! BTW, do you know what the timeline is to have the documentation updated for 1.3? It is as we speak. Lewis did quite a good job for the wiki docs on 1.3. -- *Lewis* -- *Lewis* -- *Lewis* -- *Lewis*
Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)
I've got a few very large (upwards of 3 MB) XML files I'm trying to index, and I'm having trouble. Previously I'd had trouble with the fetch; now that seems to be okay, but due to the size of the files the parse takes much too long. Is there a good way to optimize this that I'm missing? Is lengthy parsing of XML a known problem? I recognize that part of my problem is that I'm doing my testing from my aging desktop PC, and it will run faster when I move things to the server, but it's still slow. I do get the following weird message in my log when I run ParserChecker or the crawler: 2011-10-26 09:51:47,729 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/xml, but they are not mapped to it in the parse-plugins.xml file 2011-10-26 10:06:40,639 WARN parse.ParseUtil - TIMEOUT parsing http://www.aip.org/history/ead/19990074.xml with org.apache.nutch.parse.tika.TikaParser@18355aa 2011-10-26 10:06:40,639 WARN parse.ParseUtil - Unable to successfully parse content http://www.aip.org/history/ead/19990074.xml of type application/xml My ParserChecker results look like this: # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://www.aip.org/history/ead/19990074.xml - Url --- http://www.aip.org/history/ead/19990074.xml- ParseData - Version: 5 Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content Title: Outlinks: 0 Content Metadata: Parse Metadata: - ParseText - And here's everything that might be relevant in my nutch-site.xml; I've tried it both with and without the urlmeta plugin, and that doesn't make a difference: property namedb.max.outlinks.per.page/name value-1/value descriptionThe maximum number of outlinks that we'll process for a page. If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property property namefile.content.limit/name value-1/value descriptionThe length limit for downloaded content using the file:// protocol, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. /description /property property namehttp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property property nameftp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. Caution: classical ftp RFCs never defines partial transfer and, in fact, some ftp servers out there do not handle client side forced close-down very well. Our implementation tries its best to handle such situations smoothly. /description /property property namehttp.timeout/name value4294967290/value descriptionThe default network timeout, in milliseconds./description /property property nameftp.timeout/name value4294967290/value descriptionDefault timeout for ftp client socket, in millisec. Please also see ftp.keep.connection below./description /property property nameftp.server.timeout/name value4294967290/value descriptionAn estimation of ftp server idle time, in millisec. Typically it is 12 millisec for many ftp servers out there. Better be conservative here. Together with ftp.timeout, it is used to decide if we need to delete (annihilate) current ftp.client instance and force to start another ftp.client instance anew. This is necessary because a fetcher thread may not be able to obtain next request from queue in time (due to idleness) before our ftp client times out or remote server disconnects. Used only when ftp.keep.connection is true (please see below). /description /property property nameparser.timeout/name value900/value descriptionTimeout in seconds for the parsing of a document, otherwise treats it as an exception and moves on the the following documents. This parameter is applied to any Parser implementation. Set to -1 to deactivate, bearing in mind that this could cause the parsing to crash because of a very long or corrupted document. /description /property property namefetcher.threads.fetch/name value1/value descriptionThe number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection)./description /property property nameplugin.includes/name
Re: 1) success 2) how to tell Nutch index everything
On Wednesday 26 October 2011 16:37:14 Fred Zimmerman wrote: 1) I resolved the issues with solrindex. It turned out to be a matter of adding all the nutch schema-specific fields to solr's schema.xml. there was one gotcha which is that the latest solr schema does not have a default fieldtype text as in Nutch 1.3/schema.xml; you must use text_general. You're free to use any nameconvention you want in Solr. We ship a complete working Solr schema. The fieldType's name doesn't really matter. We do not intend to ship an advanced schema, developers must make changes that are appropriate for their specific environment, use-cases and scenario. A comment for developers is that the use case of copying the nutch schema to overwrite the solr one only works for people who are beginning their indexing with a crawl. More detailed instructions on how to modify solr/schema.xml for nutch would be helpful, or better yet, a script to add the appropriate fields. The Solr schema provided with Nutch tells you exactly which fields are used. Detailed instructions on how to work it with Solr is out-of-scope in my opinion. You're ofcourse free to make changes to the wiki :) 2) is there a way to tell Nutch to index everything at a given site? I am crawling a couple of my own sites and it seems rather clumsy just to give Nutch a big TopN. wouldn't an all value be helpful? Only way to do this is keep running a crawl cycle until all existing and urls- to-be-discovered are exhausted until fetch interval tells the generator to refetch. Cheers -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)
The actual parse which is producing time outs happens early in the process. There are, to my knowledge, no Nutch settings to make this faster or change its behaviour, it's all about the parser implementation. Try increasing your parser.timeout setting. On Wednesday 26 October 2011 16:45:33 Chip Calhoun wrote: I've got a few very large (upwards of 3 MB) XML files I'm trying to index, and I'm having trouble. Previously I'd had trouble with the fetch; now that seems to be okay, but due to the size of the files the parse takes much too long. Is there a good way to optimize this that I'm missing? Is lengthy parsing of XML a known problem? I recognize that part of my problem is that I'm doing my testing from my aging desktop PC, and it will run faster when I move things to the server, but it's still slow. I do get the following weird message in my log when I run ParserChecker or the crawler: 2011-10-26 09:51:47,729 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/xml, but they are not mapped to it in the parse-plugins.xml file 2011-10-26 10:06:40,639 WARN parse.ParseUtil - TIMEOUT parsing http://www.aip.org/history/ead/19990074.xml with org.apache.nutch.parse.tika.TikaParser@18355aa 2011-10-26 10:06:40,639 WARN parse.ParseUtil - Unable to successfully parse content http://www.aip.org/history/ead/19990074.xml of type application/xml My ParserChecker results look like this: # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://www.aip.org/history/ead/19990074.xml - Url --- http://www.aip.org/history/ead/19990074.xml- ParseData - Version: 5 Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content Title: Outlinks: 0 Content Metadata: Parse Metadata: - ParseText - And here's everything that might be relevant in my nutch-site.xml; I've tried it both with and without the urlmeta plugin, and that doesn't make a difference:
RE: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)
Increasing parser.timeout to 3600 got me what I needed. I only have a few files this huge, so I'll live with that. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 26, 2011 10:55 AM To: user@nutch.apache.org Subject: Re: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?) The actual parse which is producing time outs happens early in the process. There are, to my knowledge, no Nutch settings to make this faster or change its behaviour, it's all about the parser implementation. Try increasing your parser.timeout setting. On Wednesday 26 October 2011 16:45:33 Chip Calhoun wrote: I've got a few very large (upwards of 3 MB) XML files I'm trying to index, and I'm having trouble. Previously I'd had trouble with the fetch; now that seems to be okay, but due to the size of the files the parse takes much too long. Is there a good way to optimize this that I'm missing? Is lengthy parsing of XML a known problem? I recognize that part of my problem is that I'm doing my testing from my aging desktop PC, and it will run faster when I move things to the server, but it's still slow. I do get the following weird message in my log when I run ParserChecker or the crawler: 2011-10-26 09:51:47,729 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/xml, but they are not mapped to it in the parse-plugins.xml file 2011-10-26 10:06:40,639 WARN parse.ParseUtil - TIMEOUT parsing http://www.aip.org/history/ead/19990074.xml with org.apache.nutch.parse.tika.TikaParser@18355aa 2011-10-26 10:06:40,639 WARN parse.ParseUtil - Unable to successfully parse content http://www.aip.org/history/ead/19990074.xml of type application/xml My ParserChecker results look like this: # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://www.aip.org/history/ead/19990074.xml - Url --- http://www.aip.org/history/ead/19990074.xml- ParseData - Version: 5 Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content Title: Outlinks: 0 Content Metadata: Parse Metadata: - ParseText - And here's everything that might be relevant in my nutch-site.xml; I've tried it both with and without the urlmeta plugin, and that doesn't make a difference:
Re: Fetcher NPE's
Hi Markus, the error resembles a problem I've observed some time ago but never managed to open an issue. Opened right now: https://issues.apache.org/jira/browse/NUTCH-1182 The stack you observed is the same. Sebastian On 10/19/2011 05:01 PM, Markus Jelsma wrote: Hi, We sometimes see a fetcher task failing with 0 pages. Inspecing the logs it's clear URL's are actually fetched until due to some reason a NPE occurs. The thread then dies and seems to output 0 records. The URL's themselves are fetchable using index- or parser checker, no problem there. Any ideas how we can pinpoint the source of the issue? Thanks, A sample exception: 2011-10-19 14:30:50,145 INFO org.apache.nutch.fetcher.Fetcher: fetch of http://SOME_URL/ failed with: java.lang.NullPointerException 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: java.lang.NullPointerException 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at java.lang.System.arraycopy(Native Method) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1276) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1193) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at java.io.DataOutputStream.writeByte(DataOutputStream.java:136) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.io.Text.write(Text.java:281) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1060) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:591) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:936) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:805) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: fetcher caught:java.lang.NullPointerException The code catching the error: 801 } catch (Throwable t) { // unexpected exception 802 // unblock 803 fetchQueues.finishFetchItem(fit); 804 logError(fit.url, t.toString()); 805 output(fit.url, fit.datum, null, ProtocolStatus.STATUS_FAILED, CrawlDatum.STATUS_FETCH_RETRY); 806 }