Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread lewis john mcgibbney
Hi Fred, Please ensure that the linkdb command was executed succesfully. The output logs do not indicate this. Looks like you've got a '-' minus character in from of the relative linkdb directory as well. HTH On Wed, Oct 26, 2011 at 1:27 AM, Fred Zimmerman zimzaz@gmail.comwrote: I'm still

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Markus Jelsma
Besises, the -linkdb param is 1.4 not 1.3 that's what's wrong here. Bai explicitely mentioned 1.4 Hi Fred, Please ensure that the linkdb command was executed succesfully. The output logs do not indicate this. Looks like you've got a '-' minus character in from of the relative linkdb

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Fred Zimmerman
OK, I've fixed the problem with the parameters giving incorrect paths to the files. Now I get this: $ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb crawl/linkdb crawl/segments/* SolrIndexer: starting at 2011-10-26 12:57:57 java.io.IOException: Job failed!

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Fred Zimmerman
that's it. org.apache.solr.common.SolrException: ERROR:unknown field 'content' *ERROR:unknown field 'content'* request: http://search.zimzaz.com:8983/solr/update?wt=javabinversion=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:436)

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Markus Jelsma
Add the schema.xml from nutch/conf to your Solr core. btw: be careful with your host and port in the mailing lists. If it's open On Wednesday 26 October 2011 15:07:56 Fred Zimmerman wrote: that's it. org.apache.solr.common.SolrException: ERROR:unknown field 'content' *ERROR:unknown

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Fred Zimmerman
I added just the content field ... I have already modified solr's schema.xml to accommodate some other data types. Now when starting solr ... INFO: SolrUpdateServlet.init() done 2011-10-26 13:29:50.849:INFO::Started SocketConnector@0.0.0.0:8983 2011-10-26 13:30:23.129:WARN::/solr/admin/

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread lewis john mcgibbney
Hi Fred, These are clearly Solr aimed questions, which I would observe are specific to your schema. Maybe try the Solr archives for key words or else try the Solr user lists.I think that you are much more likely to get a substantiated response there. Thank you On Wed, Oct 26, 2011 at 3:31 PM,

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Fred Zimmerman
will do. Of course I have already googled these terms without much luck. Fred On Wed, Oct 26, 2011 at 9:34 AM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Fred, These are clearly Solr aimed questions, which I would observe are specific to your schema. Maybe try the Solr

Re: Segment cleanup

2011-10-26 Thread Bai Shen
On Tue, Oct 25, 2011 at 1:25 PM, Markus Jelsma markus.jel...@openindex.iowrote: Is there a reason to keep a segment around after it's been indexed? When following the tutorial, I ended up sending the same segment to the solr server multiple times because I was using segments/* as my

Re: Fwd: Understanding Nutch workflow

2011-10-26 Thread Bai Shen
Gotcha. Maybe I'll see about starting a 1.4 version of the tutorial. Not sure if I'll have time, though. On Tue, Oct 25, 2011 at 2:14 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Thanks, this is now sorted out. For refernce, you can sign up and commit your own changes to the

Re: Segment cleanup

2011-10-26 Thread Markus Jelsma
On Wednesday 26 October 2011 16:24:15 Bai Shen wrote: On Tue, Oct 25, 2011 at 1:25 PM, Markus Jelsma markus.jel...@openindex.iowrote: Is there a reason to keep a segment around after it's been indexed? When following the tutorial, I ended up sending the same segment to the solr

1) success 2) how to tell Nutch index everything

2011-10-26 Thread Fred Zimmerman
1) I resolved the issues with solrindex. It turned out to be a matter of adding all the nutch schema-specific fields to solr's schema.xml. there was one gotcha which is that the latest solr schema does not have a default fieldtype text as in Nutch 1.3/schema.xml; you must use text_general. A

Re: Fwd: Understanding Nutch workflow

2011-10-26 Thread lewis john mcgibbney
1.3 will cover 1.4. The main point was regarding the change in architecture when taking into consideration the new runtime directory structure which was introduced in Nutch 1.3. Feel free to join me on getting a Hadoop tutorial for 1.4. I'ts been on the agenda but somewhat shelved. On Wed, Oct

Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

2011-10-26 Thread Chip Calhoun
I've got a few very large (upwards of 3 MB) XML files I'm trying to index, and I'm having trouble. Previously I'd had trouble with the fetch; now that seems to be okay, but due to the size of the files the parse takes much too long. Is there a good way to optimize this that I'm missing? Is

Re: 1) success 2) how to tell Nutch index everything

2011-10-26 Thread Markus Jelsma
On Wednesday 26 October 2011 16:37:14 Fred Zimmerman wrote: 1) I resolved the issues with solrindex. It turned out to be a matter of adding all the nutch schema-specific fields to solr's schema.xml. there was one gotcha which is that the latest solr schema does not have a default fieldtype

Re: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

2011-10-26 Thread Markus Jelsma
The actual parse which is producing time outs happens early in the process. There are, to my knowledge, no Nutch settings to make this faster or change its behaviour, it's all about the parser implementation. Try increasing your parser.timeout setting. On Wednesday 26 October 2011 16:45:33

RE: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

2011-10-26 Thread Chip Calhoun
Increasing parser.timeout to 3600 got me what I needed. I only have a few files this huge, so I'll live with that. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 26, 2011 10:55 AM To: user@nutch.apache.org Subject: Re: Extremely long

Re: Fetcher NPE's

2011-10-26 Thread Sebastian Nagel
Hi Markus, the error resembles a problem I've observed some time ago but never managed to open an issue. Opened right now: https://issues.apache.org/jira/browse/NUTCH-1182 The stack you observed is the same. Sebastian On 10/19/2011 05:01 PM, Markus Jelsma wrote: Hi, We sometimes see a