date:20111026

OutOfMemoryError when indexing into Solr

2011-10-26 Thread Arkadi.Kosmynin

Hi, I am working with a Nutch 1.4 snapshot and having a very strange problem that makes the system run out of memory when indexing into Solr. This does not look like a trivial lack of memory problem that can be solved by giving more memory to the JVM. I've increased the max memory size from 2Gb

Re: Fetcher NPE's

2011-10-26 Thread Sebastian Nagel

Hi Markus, the error resembles a problem I've observed some time ago but never managed to open an issue. Opened right now: https://issues.apache.org/jira/browse/NUTCH-1182 The stack you observed is the same. Sebastian On 10/19/2011 05:01 PM, Markus Jelsma wrote: Hi, We sometimes see a fetche

RE: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

2011-10-26 Thread Chip Calhoun

Increasing parser.timeout to 3600 got me what I needed. I only have a few files this huge, so I'll live with that. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, October 26, 2011 10:55 AM To: user@nutch.apache.org Subject: Re: Extremely long p

Re: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

2011-10-26 Thread Markus Jelsma

The actual parse which is producing time outs happens early in the process. There are, to my knowledge, no Nutch settings to make this faster or change its behaviour, it's all about the parser implementation. Try increasing your parser.timeout setting. On Wednesday 26 October 2011 16:45:33 Chip

Re: 1) success 2) how to tell Nutch "index everything"

2011-10-26 Thread Markus Jelsma

On Wednesday 26 October 2011 16:37:14 Fred Zimmerman wrote: > 1) I resolved the issues with solrindex. It turned out to be a matter of > adding all the nutch schema-specific fields to solr's schema.xml. there > was one gotcha which is that the latest solr schema does not have a > default fieldty

Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

2011-10-26 Thread Chip Calhoun

I've got a few very large (upwards of 3 MB) XML files I'm trying to index, and I'm having trouble. Previously I'd had trouble with the fetch; now that seems to be okay, but due to the size of the files the parse takes much too long. Is there a good way to optimize this that I'm missing? Is lengt

Re: Fwd: Understanding Nutch workflow

2011-10-26 Thread lewis john mcgibbney

1.3 will cover 1.4. The main point was regarding the change in architecture when taking into consideration the new runtime directory structure which was introduced in Nutch 1.3. Feel free to join me on getting a Hadoop tutorial for 1.4. I'ts been on the agenda but somewhat shelved. On Wed, Oct 26

1) success 2) how to tell Nutch "index everything"

2011-10-26 Thread Fred Zimmerman

1) I resolved the issues with solrindex. It turned out to be a matter of adding all the nutch schema-specific fields to solr's schema.xml. there was one gotcha which is that the latest solr schema does not have a default fieldtype "text" as in Nutch 1.3/schema.xml; you must use "text_general". A

Re: Segment cleanup

2011-10-26 Thread Markus Jelsma

On Wednesday 26 October 2011 16:24:15 Bai Shen wrote: > On Tue, Oct 25, 2011 at 1:25 PM, Markus Jelsma > > wrote: > > > Is there a reason to keep a segment around after it's been indexed? > > > When following the tutorial, I ended up sending the same segment to > > > the solr server multiple ti

Re: Fwd: Understanding Nutch workflow

2011-10-26 Thread Bai Shen

Gotcha. Maybe I'll see about starting a 1.4 version of the tutorial. Not sure if I'll have time, though. On Tue, Oct 25, 2011 at 2:14 PM, lewis john mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Thanks, this is now sorted out. > > For refernce, you can sign up and commit your own changes to t

Re: Segment cleanup

2011-10-26 Thread Bai Shen

On Tue, Oct 25, 2011 at 1:25 PM, Markus Jelsma wrote: > > Is there a reason to keep a segment around after it's been indexed? When > > following the tutorial, I ended up sending the same segment to the solr > > server multiple times because I was using segments/* as my argument. > > Only send the

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Fred Zimmerman

will do. Of course I have already googled these terms without much luck. Fred On Wed, Oct 26, 2011 at 9:34 AM, lewis john mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Fred, > > These are clearly Solr aimed questions, which I would observe are specific > to your schema. Maybe try the Solr

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread lewis john mcgibbney

Hi Fred, These are clearly Solr aimed questions, which I would observe are specific to your schema. Maybe try the Solr archives for key words or else try the Solr user lists.I think that you are much more likely to get a substantiated response there. Thank you On Wed, Oct 26, 2011 at 3:31 PM, Fr

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Fred Zimmerman

I added just the field ... I have already modified solr's schema.xml to accommodate some other data types. Now when starting solr ... INFO: SolrUpdateServlet.init() done 2011-10-26 13:29:50.849:INFO::Started SocketConnector@0.0.0.0:8983 2011-10-26 13:30:23.129:WARN::/solr/admin/ java.lang.Illega

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Markus Jelsma

Add the schema.xml from nutch/conf to your Solr core. btw: be careful with your host and port in the mailing lists. If it's open On Wednesday 26 October 2011 15:07:56 Fred Zimmerman wrote: > that's it. > > org.apache.solr.common.SolrException: ERROR:unknown field 'content' > > *ERROR:unknow

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Fred Zimmerman

that's it. org.apache.solr.common.SolrException: ERROR:unknown field 'content' *ERROR:unknown field 'content'* request: http://search.zimzaz.com:8983/solr/update?wt=javabin&version=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:436)

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Markus Jelsma

Check your hadoop.log and Solr log. If that happens there's usually i field mismatch when indexing. On Wednesday 26 October 2011 14:59:02 Fred Zimmerman wrote: > OK, I've fixed the problem with the parameters giving incorrect paths to > the files. Now I get this: > > $ bin/nutch solrindex http:/

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Fred Zimmerman

OK, I've fixed the problem with the parameters giving incorrect paths to the files. Now I get this: $ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb crawl/linkdb crawl/segments/* SolrIndexer: starting at 2011-10-26 12:57:57 java.io.IOException: Job failed!

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Markus Jelsma

Besises, the -linkdb param is 1.4 not 1.3 that's what's wrong here. Bai explicitely mentioned 1.4 > Hi Fred, > > Please ensure that the linkdb command was executed succesfully. The output > logs do not indicate this. > Looks like you've got a '-' minus character in from of the relative linkdb > d

OutOfMemoryError when indexing into Solr

Re: Fetcher NPE's

RE: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

Re: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

Re: 1) success 2) how to tell Nutch "index everything"

Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

Re: Fwd: Understanding Nutch workflow

1) success 2) how to tell Nutch "index everything"

Re: Segment cleanup

Re: Fwd: Understanding Nutch workflow

Re: Segment cleanup

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

19 matches

Site Navigation

Mail list logo

Footer information