Optimising the speed of Nutch.

2012-02-21 Thread Bharat Goyal
Hi, I have a list of around 1000 seed URLS, which I crawl till depth=2 or 3. This is done on a local machine having a configuration(having no other large resource consuming processes running) : Dual Core (2.4 GHz), 4GB Ram It takes around 14-15 hours to crawl this seedlist, which generates

Re: Optimising the speed of Nutch.

2012-02-21 Thread Julien Nioche
See http://*wiki*.apache.org/*nutch*/OptimizingCrawls for a checklist On 21 February 2012 10:47, Bharat Goyal bharat.go...@shiksha.com wrote: No of fetcher threads is equal to default value(10), What is the optimum value for no of threads? Also, the fetching and parsing are not seperate.

problem with solrindex

2012-02-21 Thread kaveh
so I am getting this error while running solrindex : org.apache.solr.common.SolrException: ERROR_httpwww2moderncomsitegiftregistryhtml_multiple_values_encountered_for_non_multiValued_field_title_2Modern_Gift_Registry_giftregistryhtml

attn:Markus :) multiple_values_encountered_for_non_multiValued_field_title

2012-02-21 Thread kaveh minooie
Hi markus I was searching about this issue and I saw that had the same problem before. how did you solve yours? this is the error that I am getting when I run solrindex: org.apache.solr.common.SolrException:

Re: Please help - Nutch fetch command not fetching data

2012-02-21 Thread apachenutch
Any suggestions? -- View this message in context: http://lucene.472066.n3.nabble.com/Please-help-Nutch-fetch-command-not-fetching-data-tp3764751p3764865.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Please help - Nutch fetch command not fetching data

2012-02-21 Thread apachenutch
Update DB was done, after inject, generate, fetch and parse. Tried iterating after doing the update. -- View this message in context: http://lucene.472066.n3.nabble.com/Please-help-Nutch-fetch-command-not-fetching-data-tp3764751p3764994.html Sent from the Nutch - User mailing list archive at

Re: Failed fetching

2012-02-21 Thread tiagorcs
I've downloaded the sources and compiled them myself. Both protocol-http and protocol-httpclient (with basic auth) are working like a charm now. Thx for the help! T -- View this message in context: http://lucene.472066.n3.nabble.com/Failed-fetching-tp3710422p3765295.html Sent from the Nutch

Re: Please help - Nutch fetch command not fetching data

2012-02-21 Thread SUJIT PAL
Hi apachenutch, Something of a wild guess here. Given that you are using the same seed file as I am, I would have expected to see a single URL in the index at the end of the first iteration, not 10. So the only time I have observed similar behavior was when the fetcher truncated the file

[nutchgora] - proposal to support distributed indexing

2012-02-21 Thread SUJIT PAL
Hi, I need to move the SOLR based search platform to a distributed setup, and therefore need to be able to write to multiple SOLR servers from Nutch (working on the nutchgora branch, so this may be specific to this branch). Here is what I think I need to do... Currently, SolrIndexerJob writes

Re: [nutchgora] - proposal to support distributed indexing

2012-02-21 Thread Julien Nioche
Hi Sujit, Sounds good. A nice way of doing it would be to make so that people can define how to partition over the SOLR instances in the way they want e.g. consistent hashing, URL range or crawldb metadata by taking a class name as parameter. Does not need to be pluggable I think. I had

Re: attn:Markus :) multiple_values_encountered_for_non_multiValued_field_title

2012-02-21 Thread Geek Gamer
change the line in schema,xml from : field name=title type=text stored=false indexed=true termVectors=true multiValued=false/ to: field name=title type=text stored=false indexed=true termVectors=true multiValued=true/ it is unusual to have multiple titles in a webpage, can you provide the url