Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread lewis john mcgibbney
Hi Fred,

Please ensure that the linkdb command was executed succesfully. The output
logs do not indicate this.
Looks like you've got a '-' minus character in from of the relative linkdb
directory as well.

HTH

On Wed, Oct 26, 2011 at 1:27 AM, Fred Zimmerman zimzaz@gmail.comwrote:

 I'm still having trouble with this in 1.3. looks as if there's something
 dumb with syntax or file structure but can't get it.

 $ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb
 -linkdb crawl/linkdb crawl/segments/*

 SolrIndexer: starting at 2011-10-25 23:26:02
 org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
 file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/crawl_fetch
 Input path does not exist:
 file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/crawl_parse
 Input path does not exist:
 file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/parse_data
 Input path does not exist:
 file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/parse_text
 Input path does not exist:
 file:/home/bitnami/nutch-1.3/runtime/local/-linkdb/current


 On Tue, Oct 25, 2011 at 12:49 PM, Markus Jelsma
 markus.jel...@openindex.iowrote:

  From the changelog:
  http://svn.apache.org/viewvc/nutch/trunk/CHANGES.txt?view=markup
 
  111 * NUTCH-1054 LinkDB optional during indexing (jnioche)
 
  With your command, the given linkdb is interpreted as a segment.
 
  https://issues.apache.org/jira/browse/NUTCH-1054
 
  This is the new command:
 
  Usage: SolrIndexer solr url crawldb [-linkdb linkdb] (segment ...
 |
  -
  dir segments) [-noCommit
 
  On Tuesday 25 October 2011 18:41:09 Bai Shen wrote:
   I'm having a similar issue.  I'm using 1.4 and getting these errors
 with
   linkdb.  The segments seem fine.
  
   2011-10-25 10:10:20,060 INFO  solr.SolrIndexer - SolrIndexer: starting
 at
   2011-10-25 10:10:20
   2011-10-25 10:10:20,110 INFO  indexer.IndexerMapReduce -
  IndexerMapReduce:
   crawldb: crawl/crawldb
   2011-10-25 10:10:20,110 INFO  indexer.IndexerMapReduce -
  IndexerMapReduces:
   adding segment: crawl/linkdb
   2011-10-25 10:10:20,136 INFO  indexer.IndexerMapReduce -
  IndexerMapReduces:
   adding segment: crawl/segments/20111025095216
   2011-10-25 10:10:20,138 INFO  indexer.IndexerMapReduce -
  IndexerMapReduces:
   adding segment: crawl/segments/2011102514
   2011-10-25 10:10:20,207 ERROR solr.SolrIndexer -
   org.apache.hadoop.mapred.InvalidInputException: Input path does not
  exist:
   file:/opt/nutch-1.4/runtime/local/crawl/linkdb/crawl_fetch
   Input path does not exist:
   file:/opt/nutch-1.4/runtime/local/crawl/linkdb/crawl_parse
   Input path does not exist:
   file:/opt/nutch-1.4/runtime/local/crawl/linkdb/parse_data
   Input path does not exist:
   file:/opt/nutch-1.4/runtime/local/crawl/linkdb/parse_text
  
  
   Did something change with 1.4?
  
   On Sun, Oct 9, 2011 at 6:15 AM, lewis john mcgibbney 
  
   lewis.mcgibb...@gmail.com wrote:
Hi Fred,
   
How many individual directories do you have under
/runtime/local/crawl/segments/
?
   
Another thing that raises alarms is the nohup.out dir's! Are these
intentional? Interestingly, missing segment data is not the same with
these dir's.
   
Does your log output indicate any discrepancies between various
 command
transitions?
   
   
   
bitnami@ip-10-202-202-68:~/nutch-1.3/nutch-1.3/runtime/local$
  bin/nutch
   
 solrindex
 http://zimzazsearch3-1.bitnamiapp.com:8983/solr/crawl/crawldb
 crawl/linkdb crawl/segments/*
 SolrIndexer: starting at 2011-10-09 00:13:24
 org.apache.hadoop.mapred.InvalidInputException: Input path does
 not
   
exist:
   
   
  file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
922143907/crawl_fetch
   
 Input path does not exist:
   
  file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
922143907/crawl_parse
   
 Input path does not exist:
   
  file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
922143907/parse_data
   
 Input path does not exist:
   
  file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
922143907/parse_text
   
 Input path does not exist:
   
  file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
922144329/crawl_fetch
   
 Input path does not exist:
   
  file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
922144329/crawl_parse
   
 Input path does not exist:
   
  file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
922144329/parse_data
   
 Input path does not exist:
   
  file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20110
922144329/parse_text
   
 Input path does not exist:
   
  file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/20111
008015309/crawl_parse
   
 Input path does not exist:
   
  

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Markus Jelsma
Besises, the -linkdb param is 1.4 not 1.3
that's what's wrong here. Bai explicitely mentioned 1.4

 Hi Fred,
 
 Please ensure that the linkdb command was executed succesfully. The output
 logs do not indicate this.
 Looks like you've got a '-' minus character in from of the relative linkdb
 directory as well.
 
 HTH
 
 On Wed, Oct 26, 2011 at 1:27 AM, Fred Zimmerman zimzaz@gmail.comwrote:
  I'm still having trouble with this in 1.3. looks as if there's something
  dumb with syntax or file structure but can't get it.
  
  $ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb
  -linkdb crawl/linkdb crawl/segments/*
  
  SolrIndexer: starting at 2011-10-25 23:26:02
  org.apache.hadoop.mapred.InvalidInputException: Input path does not
  exist:
  file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/crawl_fetch
  Input path does not exist:
  file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/crawl_parse
  Input path does not exist:
  file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/parse_data
  Input path does not exist:
  file:/home/bitnami/nutch-1.3/runtime/local/crawl/linkdb/parse_text
  Input path does not exist:
  file:/home/bitnami/nutch-1.3/runtime/local/-linkdb/current
  
  
  On Tue, Oct 25, 2011 at 12:49 PM, Markus Jelsma
  
  markus.jel...@openindex.iowrote:
   From the changelog:
   http://svn.apache.org/viewvc/nutch/trunk/CHANGES.txt?view=markup
   
   111 * NUTCH-1054 LinkDB optional during indexing (jnioche)
   
   With your command, the given linkdb is interpreted as a segment.
   
   https://issues.apache.org/jira/browse/NUTCH-1054
   
   This is the new command:
   
   Usage: SolrIndexer solr url crawldb [-linkdb linkdb] (segment
   ...
   
   -
   dir segments) [-noCommit
   
   On Tuesday 25 October 2011 18:41:09 Bai Shen wrote:
I'm having a similar issue.  I'm using 1.4 and getting these errors
  
  with
  
linkdb.  The segments seem fine.

2011-10-25 10:10:20,060 INFO  solr.SolrIndexer - SolrIndexer:
starting
  
  at
  
2011-10-25 10:10:20
2011-10-25 10:10:20,110 INFO  indexer.IndexerMapReduce -
   
   IndexerMapReduce:
crawldb: crawl/crawldb
2011-10-25 10:10:20,110 INFO  indexer.IndexerMapReduce -
   
   IndexerMapReduces:
adding segment: crawl/linkdb
2011-10-25 10:10:20,136 INFO  indexer.IndexerMapReduce -
   
   IndexerMapReduces:
adding segment: crawl/segments/20111025095216
2011-10-25 10:10:20,138 INFO  indexer.IndexerMapReduce -
   
   IndexerMapReduces:
adding segment: crawl/segments/2011102514
2011-10-25 10:10:20,207 ERROR solr.SolrIndexer -
org.apache.hadoop.mapred.InvalidInputException: Input path does not
   
   exist:
file:/opt/nutch-1.4/runtime/local/crawl/linkdb/crawl_fetch
Input path does not exist:
file:/opt/nutch-1.4/runtime/local/crawl/linkdb/crawl_parse
Input path does not exist:
file:/opt/nutch-1.4/runtime/local/crawl/linkdb/parse_data
Input path does not exist:
file:/opt/nutch-1.4/runtime/local/crawl/linkdb/parse_text


Did something change with 1.4?

On Sun, Oct 9, 2011 at 6:15 AM, lewis john mcgibbney 

lewis.mcgibb...@gmail.com wrote:
 Hi Fred,
 
 How many individual directories do you have under
 /runtime/local/crawl/segments/
 ?
 
 Another thing that raises alarms is the nohup.out dir's! Are these
 intentional? Interestingly, missing segment data is not the same
 with these dir's.
 
 Does your log output indicate any discrepancies between various
  
  command
  
 transitions?
 
 
 
 bitnami@ip-10-202-202-68:~/nutch-1.3/nutch-1.3/runtime/local$
   
   bin/nutch
   
  solrindex
  http://zimzazsearch3-1.bitnamiapp.com:8983/solr/crawl/crawldb
  crawl/linkdb crawl/segments/*
  SolrIndexer: starting at 2011-10-09 00:13:24
  org.apache.hadoop.mapred.InvalidInputException: Input path does
  
  not
  
 exist:
   file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201
   10
   
 922143907/crawl_fetch
 
  Input path does not exist:
   file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201
   10
   
 922143907/crawl_parse
 
  Input path does not exist:
   file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201
   10
   
 922143907/parse_data
 
  Input path does not exist:
   file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201
   10
   
 922143907/parse_text
 
  Input path does not exist:
   file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201
   10
   
 922144329/crawl_fetch
 
  Input path does not exist:
   file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201
   10
   
 922144329/crawl_parse
 
  Input path does not exist:
   file:/home/bitnami/nutch-1.3/nutch-1.3/runtime/local/crawl/segments/201
   10
   
 922144329/parse_data
 
  Input 

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Fred Zimmerman
OK, I've fixed the problem with the parameters giving incorrect paths to the
files. Now I get this:

$ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb
crawl/linkdb crawl/segments/*
SolrIndexer: starting at 2011-10-26 12:57:57
java.io.IOException: Job failed!


Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Fred Zimmerman
that's it.

org.apache.solr.common.SolrException: ERROR:unknown field 'content'

*ERROR:unknown field 'content'*

request: http://search.zimzaz.com:8983/solr/update?wt=javabinversion=2
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:436)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:245)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
at
org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:82)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48)
at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
2011-10-26 12:58:20,596 ERROR solr.SolrIndexer - java.io.IOException: Job
failed!


On Wed, Oct 26, 2011 at 9:03 AM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Check your hadoop.log and Solr log. If that happens there's usually i field
 mismatch when indexing.

 On Wednesday 26 October 2011 14:59:02 Fred Zimmerman wrote:
  OK, I've fixed the problem with the parameters giving incorrect paths to
  the files. Now I get this:
 
  $ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb
  crawl/linkdb crawl/segments/*
  SolrIndexer: starting at 2011-10-26 12:57:57
  java.io.IOException: Job failed!

 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350



Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Markus Jelsma
Add the schema.xml from nutch/conf to your Solr core.

btw: be careful with your host and port in the mailing lists. If it's open

On Wednesday 26 October 2011 15:07:56 Fred Zimmerman wrote:
 that's it.
 
 org.apache.solr.common.SolrException: ERROR:unknown field 'content'
 
 *ERROR:unknown field 'content'*
 
 request: http://url/solr/update?wt=javabinversion=2
 at
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp
 SolrServer.java:436) at
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp
 SolrServer.java:245) at
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Abstract
 UpdateRequest.java:105) at
 org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at
 org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:82)
 at
 org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.ja
 va:48) at
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
 at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
 2011-10-26 12:58:20,596 ERROR solr.SolrIndexer - java.io.IOException: Job
 failed!
 
 
 On Wed, Oct 26, 2011 at 9:03 AM, Markus Jelsma
 
 markus.jel...@openindex.iowrote:
  Check your hadoop.log and Solr log. If that happens there's usually i
  field mismatch when indexing.
  
  On Wednesday 26 October 2011 14:59:02 Fred Zimmerman wrote:
   OK, I've fixed the problem with the parameters giving incorrect paths
   to the files. Now I get this:
   
   $ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb
   crawl/linkdb crawl/segments/*
   SolrIndexer: starting at 2011-10-26 12:57:57
   java.io.IOException: Job failed!
  
  --
  Markus Jelsma - CTO - Openindex
  http://www.linkedin.com/in/markus17
  050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Fred Zimmerman
I added just the content field ... I have already modified solr's
schema.xml to accommodate some other data types.

Now when starting solr ...

INFO: SolrUpdateServlet.init() done
2011-10-26 13:29:50.849:INFO::Started SocketConnector@0.0.0.0:8983
2011-10-26 13:30:23.129:WARN::/solr/admin/
java.lang.IllegalStateException: STREAM
at org.mortbay.jetty.Response.getWriter(Response.java:616) etc ...


On Wed, Oct 26, 2011 at 9:16 AM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Add the schema.xml from nutch/conf to your Solr core.

 btw: be careful with your host and port in the mailing lists. If it's
 open

 On Wednesday 26 October 2011 15:07:56 Fred Zimmerman wrote:
  that's it.
 
  org.apache.solr.common.SolrException: ERROR:unknown field 'content'
 
  *ERROR:unknown field 'content'*
 
  request: http://url/solr/update?wt=javabinversion=2
  at
 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp
  SolrServer.java:436) at
 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp
  SolrServer.java:245) at
 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Abstract
  UpdateRequest.java:105) at
  org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at
  org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:82)
  at
 
 org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.ja
  va:48) at
  org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
  at
  org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
  2011-10-26 12:58:20,596 ERROR solr.SolrIndexer - java.io.IOException: Job
  failed!
 
 
  On Wed, Oct 26, 2011 at 9:03 AM, Markus Jelsma
 
  markus.jel...@openindex.iowrote:
   Check your hadoop.log and Solr log. If that happens there's usually i
   field mismatch when indexing.
  
   On Wednesday 26 October 2011 14:59:02 Fred Zimmerman wrote:
OK, I've fixed the problem with the parameters giving incorrect paths
to the files. Now I get this:
   
$ bin/nutch solrindex http://search.zimzaz.com:8983/solrcrawl/crawldb
crawl/linkdb crawl/segments/*
SolrIndexer: starting at 2011-10-26 12:57:57
java.io.IOException: Job failed!
  
   --
   Markus Jelsma - CTO - Openindex
   http://www.linkedin.com/in/markus17
   050-8536620 / 06-50258350

 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350



Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread lewis john mcgibbney
Hi Fred,

These are clearly Solr aimed questions, which I would observe are specific
to your schema. Maybe try the Solr archives for key words or else try the
Solr user lists.I think that you are much more likely to get a substantiated
response there.

Thank you

On Wed, Oct 26, 2011 at 3:31 PM, Fred Zimmerman zimzaz@gmail.comwrote:

 I added just the content field ... I have already modified solr's
 schema.xml to accommodate some other data types.

 Now when starting solr ...

 INFO: SolrUpdateServlet.init() done
 2011-10-26 13:29:50.849:INFO::Started SocketConnector@0.0.0.0:8983
 2011-10-26 13:30:23.129:WARN::/solr/admin/
 java.lang.IllegalStateException: STREAM
at org.mortbay.jetty.Response.getWriter(Response.java:616) etc ...


 On Wed, Oct 26, 2011 at 9:16 AM, Markus Jelsma
 markus.jel...@openindex.iowrote:

  Add the schema.xml from nutch/conf to your Solr core.
 
  btw: be careful with your host and port in the mailing lists. If it's
  open
 
  On Wednesday 26 October 2011 15:07:56 Fred Zimmerman wrote:
   that's it.
  
   org.apache.solr.common.SolrException: ERROR:unknown field 'content'
  
   *ERROR:unknown field 'content'*
  
   request: http://url/solr/update?wt=javabinversion=2
   at
  
 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp
   SolrServer.java:436) at
  
 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp
   SolrServer.java:245) at
  
 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Abstract
   UpdateRequest.java:105) at
   org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at
   org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:82)
   at
  
 
 org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.ja
   va:48) at
   org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
   at
  
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
   2011-10-26 12:58:20,596 ERROR solr.SolrIndexer - java.io.IOException:
 Job
   failed!
  
  
   On Wed, Oct 26, 2011 at 9:03 AM, Markus Jelsma
  
   markus.jel...@openindex.iowrote:
Check your hadoop.log and Solr log. If that happens there's usually i
field mismatch when indexing.
   
On Wednesday 26 October 2011 14:59:02 Fred Zimmerman wrote:
 OK, I've fixed the problem with the parameters giving incorrect
 paths
 to the files. Now I get this:

 $ bin/nutch solrindex
 http://search.zimzaz.com:8983/solrcrawl/crawldb
 crawl/linkdb crawl/segments/*
 SolrIndexer: starting at 2011-10-26 12:57:57
 java.io.IOException: Job failed!
   
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
 
  --
  Markus Jelsma - CTO - Openindex
  http://www.linkedin.com/in/markus17
  050-8536620 / 06-50258350
 




-- 
*Lewis*


Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Fred Zimmerman
will do.  Of course I have already googled these terms without much luck.
 Fred

On Wed, Oct 26, 2011 at 9:34 AM, lewis john mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Fred,

 These are clearly Solr aimed questions, which I would observe are specific
 to your schema. Maybe try the Solr archives for key words or else try the
 Solr user lists.I think that you are much more likely to get a
 substantiated
 response there.

 Thank you

 On Wed, Oct 26, 2011 at 3:31 PM, Fred Zimmerman zimzaz@gmail.com
 wrote:

  I added just the content field ... I have already modified solr's
  schema.xml to accommodate some other data types.
 
  Now when starting solr ...
 
  INFO: SolrUpdateServlet.init() done
  2011-10-26 13:29:50.849:INFO::Started SocketConnector@0.0.0.0:8983
  2011-10-26 13:30:23.129:WARN::/solr/admin/
  java.lang.IllegalStateException: STREAM
 at org.mortbay.jetty.Response.getWriter(Response.java:616) etc ...
 
 
  On Wed, Oct 26, 2011 at 9:16 AM, Markus Jelsma
  markus.jel...@openindex.iowrote:
 
   Add the schema.xml from nutch/conf to your Solr core.
  
   btw: be careful with your host and port in the mailing lists. If it's
   open
  
   On Wednesday 26 October 2011 15:07:56 Fred Zimmerman wrote:
that's it.
   
org.apache.solr.common.SolrException: ERROR:unknown field 'content'
   
*ERROR:unknown field 'content'*
   
request: http://url/solr/update?wt=javabinversion=2
at
   
  
 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp
SolrServer.java:436) at
   
  
 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttp
SolrServer.java:245) at
   
  
 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Abstract
UpdateRequest.java:105) at
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at
org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:82)
at
   
  
 
 org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.ja
va:48) at
   
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
at
 org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at
   
  org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
2011-10-26 12:58:20,596 ERROR solr.SolrIndexer - java.io.IOException:
  Job
failed!
   
   
On Wed, Oct 26, 2011 at 9:03 AM, Markus Jelsma
   
markus.jel...@openindex.iowrote:
 Check your hadoop.log and Solr log. If that happens there's usually
 i
 field mismatch when indexing.

 On Wednesday 26 October 2011 14:59:02 Fred Zimmerman wrote:
  OK, I've fixed the problem with the parameters giving incorrect
  paths
  to the files. Now I get this:
 
  $ bin/nutch solrindex
  http://search.zimzaz.com:8983/solrcrawl/crawldb
  crawl/linkdb crawl/segments/*
  SolrIndexer: starting at 2011-10-26 12:57:57
  java.io.IOException: Job failed!

 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350
  
   --
   Markus Jelsma - CTO - Openindex
   http://www.linkedin.com/in/markus17
   050-8536620 / 06-50258350
  
 



 --
 *Lewis*



Re: Segment cleanup

2011-10-26 Thread Bai Shen
On Tue, Oct 25, 2011 at 1:25 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

  Is there a reason to keep a segment around after it's been indexed?  When
  following the tutorial, I ended up sending the same segment to the solr
  server multiple times because I was using segments/* as my argument.

 Only send the segment(s) that have not been indexed yet unless you have to
 reindex everything each time.


-nods-  I didn't mean to send the same segment multiple times.  I just
didn't quite realize what the index command was doing.


 
  Once I've sent it to the solr server, is there any reason not to delete
  that segment?

 You can delete a segment if:
 - you don't do any reindexing;
 - it's older than fetch interval (default 30 days) and you are sure all
 URL's
 in that segment have already been fetched in newer segment(s);
 - don't need the stored content in the segment.


What do you mean by reindexing?  Doesn't nutch handle this with it's
refetching of content after it expires?

Once I parse the content and update the db, wouldn't the segment be
irrelevant in regards to whether they get fetched or not?

The content already gets stored in the solr server to facilitate
highlighting. So I can't see why we would need to store it in nutch.


Re: Fwd: Understanding Nutch workflow

2011-10-26 Thread Bai Shen
Gotcha.  Maybe I'll see about starting a 1.4 version of the tutorial.  Not
sure if I'll have time, though.

On Tue, Oct 25, 2011 at 2:14 PM, lewis john mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Thanks, this is now sorted out.

 For refernce, you can sign up and commit your own changes to the Nutch
 wiki.


 Thanks for pointing this out.

 lewis

 On Tue, Oct 25, 2011 at 6:43 PM, Bai Shen baishen.li...@gmail.com wrote:

  BTW, found a typo in the tutorial.  It has the following.
 
  bin/nutch parse $1
 
 
  And it should be this.
 
  bin/nutch parse $s1
 
 
 
  On Tue, Sep 27, 2011 at 2:35 PM, lewis john mcgibbney 
  lewis.mcgibb...@gmail.com wrote:
 
   OK I understand. I think the main task with the documentation was that
  the
   fundamental architecture changed between Nutch 1.2  1.3. Generally
   speaking
   the community appears to understand this, but I fully recognise that
  there
   is still a good deal of work to do. Thanks for pointing this out.
  
   Lewis
  
   On Tue, Sep 27, 2011 at 7:32 PM, Bai Shen baishen.li...@gmail.com
  wrote:
  
I'm not looking for anything to be created.  It's just that a lot of
  the
documentation seems to be marked as needing updates for 1.3 and I was
wondering what the timeline for completing it was.
   
On Tue, Sep 27, 2011 at 2:28 PM, lewis john mcgibbney 
lewis.mcgibb...@gmail.com wrote:
   
 Hi,

 What documentation apart from what we have marked as in
 construction
  or
 TODO
 on the wiki would you like to see created?

 It has been a pretty long process getting these resources
 up-to-date,
 however we are getting there!

 
  
   BTW, do you know what the timeline is to have the documentation
updated
  for
   1.3?
 
  It is as we speak. Lewis did quite a good job for the wiki docs
 on
   1.3.
 



 --
 *Lewis*

   
  
  
  
   --
   *Lewis*
  
 



 --
 *Lewis*



Re: Segment cleanup

2011-10-26 Thread Markus Jelsma


On Wednesday 26 October 2011 16:24:15 Bai Shen wrote:
 On Tue, Oct 25, 2011 at 1:25 PM, Markus Jelsma
 
 markus.jel...@openindex.iowrote:
   Is there a reason to keep a segment around after it's been indexed? 
   When following the tutorial, I ended up sending the same segment to
   the solr server multiple times because I was using segments/* as my
   argument.
  
  Only send the segment(s) that have not been indexed yet unless you have
  to reindex everything each time.
 
 -nods-  I didn't mean to send the same segment multiple times.  I just
 didn't quite realize what the index command was doing.
 
   Once I've sent it to the solr server, is there any reason not to delete
   that segment?
  
  You can delete a segment if:
  - you don't do any reindexing;
  - it's older than fetch interval (default 30 days) and you are sure all
  URL's
  in that segment have already been fetched in newer segment(s);
  - don't need the stored content in the segment.
 
 What do you mean by reindexing?  Doesn't nutch handle this with it's
 refetching of content after it expires?
 
 Once I parse the content and update the db, wouldn't the segment be
 irrelevant in regards to whether they get fetched or not?
 
 The content already gets stored in the solr server to facilitate
 highlighting. So I can't see why we would need to store it in nutch.

Reindexing is useful for development environments or even production when 
Solr's index time analysis changes. If you change for tokens get indexed or 
enable of disable norms, you must reindex from scratch.



-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


1) success 2) how to tell Nutch index everything

2011-10-26 Thread Fred Zimmerman
1) I resolved the issues with solrindex. It turned out to be a matter of
adding all the nutch schema-specific fields to solr's schema.xml.  there was
one gotcha which is that the latest solr schema does not have a default
fieldtype text as in Nutch 1.3/schema.xml; you must use text_general.  A
comment for developers is that the use case of copying the nutch schema to
overwrite the solr one only works for people who are beginning their
indexing with a crawl.  More detailed instructions on how to modify
solr/schema.xml for nutch would be helpful, or better yet, a script to add
the appropriate fields.

2) is there a way to tell Nutch to index everything at a given site?  I am
crawling a couple of my own sites and it seems rather clumsy just to give
Nutch a big TopN.  wouldn't an all value be helpful?


Re: Fwd: Understanding Nutch workflow

2011-10-26 Thread lewis john mcgibbney
1.3 will cover 1.4. The main point was regarding the change in architecture
when taking into consideration the new runtime directory structure which was
introduced in Nutch 1.3.

Feel free to join me on getting a Hadoop tutorial for 1.4. I'ts been on the
agenda but somewhat shelved.

On Wed, Oct 26, 2011 at 4:25 PM, Bai Shen baishen.li...@gmail.com wrote:

 Gotcha.  Maybe I'll see about starting a 1.4 version of the tutorial.  Not
 sure if I'll have time, though.

 On Tue, Oct 25, 2011 at 2:14 PM, lewis john mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

  Thanks, this is now sorted out.
 
  For refernce, you can sign up and commit your own changes to the Nutch
  wiki.
 
 
  Thanks for pointing this out.
 
  lewis
 
  On Tue, Oct 25, 2011 at 6:43 PM, Bai Shen baishen.li...@gmail.com
 wrote:
 
   BTW, found a typo in the tutorial.  It has the following.
  
   bin/nutch parse $1
  
  
   And it should be this.
  
   bin/nutch parse $s1
  
  
  
   On Tue, Sep 27, 2011 at 2:35 PM, lewis john mcgibbney 
   lewis.mcgibb...@gmail.com wrote:
  
OK I understand. I think the main task with the documentation was
 that
   the
fundamental architecture changed between Nutch 1.2  1.3. Generally
speaking
the community appears to understand this, but I fully recognise that
   there
is still a good deal of work to do. Thanks for pointing this out.
   
Lewis
   
On Tue, Sep 27, 2011 at 7:32 PM, Bai Shen baishen.li...@gmail.com
   wrote:
   
 I'm not looking for anything to be created.  It's just that a lot
 of
   the
 documentation seems to be marked as needing updates for 1.3 and I
 was
 wondering what the timeline for completing it was.

 On Tue, Sep 27, 2011 at 2:28 PM, lewis john mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

  Hi,
 
  What documentation apart from what we have marked as in
  construction
   or
  TODO
  on the wiki would you like to see created?
 
  It has been a pretty long process getting these resources
  up-to-date,
  however we are getting there!
 
  
   
BTW, do you know what the timeline is to have the
 documentation
 updated
   for
1.3?
  
   It is as we speak. Lewis did quite a good job for the wiki docs
  on
1.3.
  
 
 
 
  --
  *Lewis*
 

   
   
   
--
*Lewis*
   
  
 
 
 
  --
  *Lewis*
 




-- 
*Lewis*


Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

2011-10-26 Thread Chip Calhoun
I've got a few very large (upwards of 3 MB) XML files I'm trying to index, and 
I'm having trouble. Previously I'd had trouble with the fetch; now that seems 
to be okay, but due to the size of the files the parse takes much too long.

Is there a good way to optimize this that I'm missing? Is lengthy parsing of 
XML a known problem? I recognize that part of my problem is that I'm doing my 
testing from my aging desktop PC, and it will run faster when I move things to 
the server, but it's still slow.

I do get the following weird message in my log when I run ParserChecker or the 
crawler:

2011-10-26 09:51:47,729 INFO  parse.ParserFactory - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes 
system property, and all claim to support the content type application/xml, but 
they are not mapped to it  in the parse-plugins.xml file
2011-10-26 10:06:40,639 WARN  parse.ParseUtil - TIMEOUT parsing 
http://www.aip.org/history/ead/19990074.xml with 
org.apache.nutch.parse.tika.TikaParser@18355aa
2011-10-26 10:06:40,639 WARN  parse.ParseUtil - Unable to successfully parse 
content http://www.aip.org/history/ead/19990074.xml of type application/xml

My ParserChecker results look like this:

# bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
http://www.aip.org/history/ead/19990074.xml
-
Url
---
http://www.aip.org/history/ead/19990074.xml-
ParseData
-
Version: 5
Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to 
successfully parse content
Title:
Outlinks: 0
Content Metadata:
Parse Metadata:
-
ParseText
-

And here's everything that might be relevant in my nutch-site.xml; I've tried 
it both with and without the urlmeta plugin, and that doesn't make a difference:

property
  namedb.max.outlinks.per.page/name
  value-1/value
  descriptionThe maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  /description
 /property 
property
  namefile.content.limit/name
  value-1/value
  descriptionThe length limit for downloaded content using the file://
  protocol, in bytes. If this value is nonnegative (=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the http.content.limit setting.
  /description
/property
property
  namehttp.content.limit/name
  value-1/value
  descriptionThe length limit for downloaded content, in bytes.
  If this value is nonnegative (=0), content longer than it will be 
  truncated; otherwise, no truncation at all.
  /description
 /property
property
  nameftp.content.limit/name
  value-1/value 
  descriptionThe length limit for downloaded content, in bytes.
  If this value is nonnegative (=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  Caution: classical ftp RFCs never defines partial transfer and, in fact,
  some ftp servers out there do not handle client side forced close-down very
  well. Our implementation tries its best to handle such situations smoothly.
  /description
/property
property
  namehttp.timeout/name
  value4294967290/value
  descriptionThe default network timeout, in milliseconds./description
/property
property
  nameftp.timeout/name
  value4294967290/value
  descriptionDefault timeout for ftp client socket, in millisec.
  Please also see ftp.keep.connection below./description
/property
property
  nameftp.server.timeout/name
  value4294967290/value
  descriptionAn estimation of ftp server idle time, in millisec.
  Typically it is 12 millisec for many ftp servers out there.
  Better be conservative here. Together with ftp.timeout, it is used to
  decide if we need to delete (annihilate) current ftp.client instance and
  force to start another ftp.client instance anew. This is necessary because
  a fetcher thread may not be able to obtain next request from queue in time
  (due to idleness) before our ftp client times out or remote server
  disconnects. Used only when ftp.keep.connection is true (please see below).
  /description
/property
property
  nameparser.timeout/name
  value900/value
  descriptionTimeout in seconds for the parsing of a document, otherwise 
treats it as an exception and 
  moves on the the following documents. This parameter is applied to any Parser 
implementation. 
  Set to -1 to deactivate, bearing in mind that this could cause
  the parsing to crash because of a very long or corrupted document.
  /description
/property
property
  namefetcher.threads.fetch/name
  value1/value
  descriptionThe number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are 
made at once (each FetcherThread handles one connection)./description
/property
 property
  nameplugin.includes/name
  

Re: 1) success 2) how to tell Nutch index everything

2011-10-26 Thread Markus Jelsma


On Wednesday 26 October 2011 16:37:14 Fred Zimmerman wrote:
 1) I resolved the issues with solrindex. It turned out to be a matter of
 adding all the nutch schema-specific fields to solr's schema.xml.  there
 was one gotcha which is that the latest solr schema does not have a
 default fieldtype text as in Nutch 1.3/schema.xml; you must use
 text_general.

You're free to use any nameconvention you want in Solr. We ship a complete 
working Solr schema. The fieldType's name doesn't really matter. We do not 
intend to ship an advanced schema, developers must make changes that are 
appropriate for their specific environment, use-cases and scenario.
 
 A comment for developers is that the use case of copying
 the nutch schema to overwrite the solr one only works for people who are
 beginning their indexing with a crawl.  More detailed instructions on how
 to modify solr/schema.xml for nutch would be helpful, or better yet, a
 script to add the appropriate fields.

The Solr schema provided with Nutch tells you exactly which fields are used. 
Detailed instructions on how to work it with Solr is out-of-scope in my 
opinion.
You're ofcourse free to make changes to the wiki :)

 
 2) is there a way to tell Nutch to index everything at a given site?  I am
 crawling a couple of my own sites and it seems rather clumsy just to give
 Nutch a big TopN.  wouldn't an all value be helpful?

Only way to do this is keep running a crawl cycle until all existing and urls-
to-be-discovered are exhausted until fetch interval tells the generator to 
refetch.

Cheers


-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

2011-10-26 Thread Markus Jelsma
The actual parse which is producing time outs happens early in the process. 
There are, to my knowledge, no Nutch settings to make this faster or change 
its behaviour, it's all about the parser implementation.

Try increasing your parser.timeout setting.

On Wednesday 26 October 2011 16:45:33 Chip Calhoun wrote:
 I've got a few very large (upwards of 3 MB) XML files I'm trying to index,
 and I'm having trouble. Previously I'd had trouble with the fetch; now
 that seems to be okay, but due to the size of the files the parse takes
 much too long.
 
 Is there a good way to optimize this that I'm missing? Is lengthy parsing
 of XML a known problem? I recognize that part of my problem is that I'm
 doing my testing from my aging desktop PC, and it will run faster when I
 move things to the server, but it's still slow.
 
 I do get the following weird message in my log when I run ParserChecker or
 the crawler:
 
 2011-10-26 09:51:47,729 INFO  parse.ParserFactory - The parsing plugins:
 [org.apache.nutch.parse.tika.TikaParser] are enabled via the
 plugin.includes system property, and all claim to support the content type
 application/xml, but they are not mapped to it  in the parse-plugins.xml
 file 2011-10-26 10:06:40,639 WARN  parse.ParseUtil - TIMEOUT parsing
 http://www.aip.org/history/ead/19990074.xml with
 org.apache.nutch.parse.tika.TikaParser@18355aa 2011-10-26 10:06:40,639
 WARN  parse.ParseUtil - Unable to successfully parse content
 http://www.aip.org/history/ead/19990074.xml of type application/xml
 
 My ParserChecker results look like this:
 
 # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText
 http://www.aip.org/history/ead/19990074.xml -
 Url
 ---
 http://www.aip.org/history/ead/19990074.xml-
 ParseData
 -
 Version: 5
 Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to
 successfully parse content Title:
 Outlinks: 0
 Content Metadata:
 Parse Metadata:
 -
 ParseText
 -
 
 And here's everything that might be relevant in my nutch-site.xml; I've
 tried it both with and without the urlmeta plugin, and that doesn't make a
 difference:



RE: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

2011-10-26 Thread Chip Calhoun
Increasing parser.timeout to 3600 got me what I needed. I only have a few files 
this huge, so I'll live with that.

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Wednesday, October 26, 2011 10:55 AM
To: user@nutch.apache.org
Subject: Re: Extremely long parsing of large XML files (Was RE: Good workaround 
for timeout?)

The actual parse which is producing time outs happens early in the process. 
There are, to my knowledge, no Nutch settings to make this faster or change its 
behaviour, it's all about the parser implementation.

Try increasing your parser.timeout setting.

On Wednesday 26 October 2011 16:45:33 Chip Calhoun wrote:
 I've got a few very large (upwards of 3 MB) XML files I'm trying to 
 index, and I'm having trouble. Previously I'd had trouble with the 
 fetch; now that seems to be okay, but due to the size of the files the 
 parse takes much too long.
 
 Is there a good way to optimize this that I'm missing? Is lengthy 
 parsing of XML a known problem? I recognize that part of my problem is 
 that I'm doing my testing from my aging desktop PC, and it will run 
 faster when I move things to the server, but it's still slow.
 
 I do get the following weird message in my log when I run 
 ParserChecker or the crawler:
 
 2011-10-26 09:51:47,729 INFO  parse.ParserFactory - The parsing plugins:
 [org.apache.nutch.parse.tika.TikaParser] are enabled via the 
 plugin.includes system property, and all claim to support the content 
 type application/xml, but they are not mapped to it  in the 
 parse-plugins.xml file 2011-10-26 10:06:40,639 WARN  parse.ParseUtil - 
 TIMEOUT parsing http://www.aip.org/history/ead/19990074.xml with 
 org.apache.nutch.parse.tika.TikaParser@18355aa 2011-10-26 10:06:40,639 
 WARN  parse.ParseUtil - Unable to successfully parse content 
 http://www.aip.org/history/ead/19990074.xml of type application/xml
 
 My ParserChecker results look like this:
 
 # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
 http://www.aip.org/history/ead/19990074.xml - Url
 ---
 http://www.aip.org/history/ead/19990074.xml-
 ParseData
 -
 Version: 5
 Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable 
 to successfully parse content Title:
 Outlinks: 0
 Content Metadata:
 Parse Metadata:
 -
 ParseText
 -
 
 And here's everything that might be relevant in my nutch-site.xml; 
 I've tried it both with and without the urlmeta plugin, and that 
 doesn't make a
 difference:



Re: Fetcher NPE's

2011-10-26 Thread Sebastian Nagel

Hi Markus,

the error resembles a problem I've observed some time ago but never managed
to open an issue. Opened right now: 
https://issues.apache.org/jira/browse/NUTCH-1182
The stack you observed is the same.

Sebastian

On 10/19/2011 05:01 PM, Markus Jelsma wrote:

Hi,

We sometimes see a fetcher task failing with 0 pages. Inspecing the logs it's
clear URL's are actually fetched until due to some reason a NPE occurs. The
thread then dies and seems to output 0 records.

The URL's themselves are fetchable using index- or parser checker, no problem
there. Any ideas how we can pinpoint the source of the issue?

Thanks,

A sample exception:

2011-10-19 14:30:50,145 INFO org.apache.nutch.fetcher.Fetcher: fetch of
http://SOME_URL/ failed with: java.lang.NullPointerException
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher:
java.lang.NullPointerException
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
java.lang.System.arraycopy(Native Method)
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1276)
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1193)
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
java.io.DataOutputStream.writeByte(DataOutputStream.java:136)
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264)
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.hadoop.io.Text.write(Text.java:281)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1060)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:591)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:936)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:805)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: fetcher
caught:java.lang.NullPointerException

The code catching the error:

801 } catch (Throwable t) { // unexpected exception
802 // unblock
803 fetchQueues.finishFetchItem(fit);
804 logError(fit.url, t.toString());
805 output(fit.url, fit.datum, null, ProtocolStatus.STATUS_FAILED,
CrawlDatum.STATUS_FETCH_RETRY);
806 }