Re : Re: fetcher.max.crawl.delay = -1 doesn't work?

2012-02-15 Thread Danicela nutch
I don't think I configured such things, how can I be sure ? - Message d'origine - De : Lewis John Mcgibbney Envoyés : 14.02.12 19:18 À : user@nutch.apache.org Objet : Re: fetcher.max.crawl.delay = -1 doesn't work? Hi Danicela, Before I try this, have you configured any other overrides

Re: Failed fetching

2012-02-15 Thread remi tassing
I just used protocol-http and it works! It's probably a configuration issue. You can download a clean version and start afresh Remi On Wed, Feb 15, 2012 at 3:46 AM, tiagorcs dasilva-ti...@mitsue.co.jpwrote: So do you suggest me to download Nutch from a different source? Maybe to reconfigure

tstamp vs. lastModified ...

2012-02-15 Thread remi tassing
Hello all, What does tstamp represent? I can we shown in Solr results after indexing. I'm interested in showing the last modified meta-data in Solr results but I'm not sure if Nutch does retrieve this value. Thanks in advance for the help! Remi

Re: how are CSV/TXT files handled

2012-02-15 Thread remi tassing
Hi, Tika is parsing properly, I think it was some kind of proxy issue and also the http.content.limit. Thanks! Remi On Fri, Feb 10, 2012 at 11:16 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Remi, Please ensure that your http.content limit is sufficient, what are you url

Re: tstamp vs. lastModified ...

2012-02-15 Thread Lewis John Mcgibbney
iirc time stamp represents when page was last fetched. Yes you should be able to specify this value in your schema and get it mapped to solr index. Last modified is when the actual page was last modified e.g. when there was a change to the page source or something. On Wed, Feb 15, 2012 at 1:26

Re: tstamp vs. lastModified ...

2012-02-15 Thread remi tassing
Hey Lewis, Thanks for the clarification! For tstamp, I can actually see it in Solr results (even thought the format is weird) How can I get Last-Modified value in Solr as well? Does Nutch need to be configured in some way? Remi On Wed, Feb 15, 2012 at 3:46 PM, Lewis John Mcgibbney

Re: tstamp vs. lastModified ...

2012-02-15 Thread Lewis John Mcgibbney
Hi Remi, On Wed, Feb 15, 2012 at 1:51 PM, remi tassing tassingr...@gmail.com wrote: Thanks for the clarification! nb For tstamp, I can actually see it in Solr results (even thought the format is weird) what is the format? How can I get Last-Modified value in Solr as well? Does Nutch

Re: tstamp vs. lastModified ...

2012-02-15 Thread remi tassing
Hi, tstamp shows a string of digits like 20020123123212 Never heard of the plugin index-more and it's poorly documented. After adding this to plugins.include, I'll need to run solrindex or is it necessary to re-parse or recrawl (I think this less likely IMO)? Thanks again Remi On Wednesday,

Re: tstamp vs. lastModified ...

2012-02-15 Thread Lewis John Mcgibbney
Hi, On Wed, Feb 15, 2012 at 4:00 PM, remi tassing tassingr...@gmail.com wrote: tstamp shows a string of digits like 20020123123212 This is OK. -mm-dd-hh-mm-ssZ It is however hellishly old ! Never heard of the plugin index-more and it's poorly documented. Well it's been included in

Re: tstamp vs. lastModified ...

2012-02-15 Thread remi tassing
Is it any quick way to see the impact of index-more? I deleted the parse related folders in the segment and re-parsed it but when I readseg there is no.difference On Wednesday, February 15, 2012, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, On Wed, Feb 15, 2012 at 4:00 PM,

Re: tstamp vs. lastModified ...

2012-02-15 Thread SUJIT PAL
Remi, I had a similar problem but for a custom field that I was trying to post to Solr (via solrindex) as a type=date in the schema.xml. Turns out my date string was formatted incorrectly (it was missing the trailing Z). From the error message it appears that perhaps the field into which this

Re: tstamp vs. lastModified ...

2012-02-15 Thread remi tassing
You're both correct, after changing the type for tstamp and lastModified from long to date, no error anymore. Next thing I need to do is setup cygwin/svn to be able to get fresh svn/trunch code...it's so cool to be up-to-date. Nutch-1.4 is just ridiculously faster than 1.2 :-) Thanks!! Remi On

Re: tstamp vs. lastModified ...

2012-02-15 Thread Markus Jelsma
You're both correct, after changing the type for tstamp and lastModified from long to date, no error anymore. Next thing I need to do is setup cygwin/svn to be able to get fresh svn/trunch code...it's so cool to be up-to-date. Nutch-1.4 is just ridiculously faster than 1.2 :-) Is it

Re: fetcher.threads.per.queue and fetcher.server.delay

2012-02-15 Thread Markus Jelsma
So I am trying to optimize the fetch performance, and I think that I miserably failing since I am not able to max out any my resources (cpu, ram, and more importantly bandwidth). obviously I am not trying to max out all of them at the same time. I just to find out the bottle neck, and I can't

Re: tstamp vs. lastModified ...

2012-02-15 Thread remi tassing
It could be interesting finding out what exactly causes such huge speed difference. For me the speed increase is on the 10x order...crazy! On Wed, Feb 15, 2012 at 9:35 PM, Markus Jelsma markus.jel...@openindex.iowrote: You're both correct, after changing the type for tstamp and lastModified

Re: Re : Re: fetcher.max.crawl.delay = -1 doesn't work?

2012-02-15 Thread Lewis John Mcgibbney
Another question I should have asked is how long is the crawl delay in robots.txt? If you read the fetcher.max.crawl.delay property description it explicitly notes that the fetcher will wait however long it is required by robots.tx until it fetches the page. Do you have this information? Thanks

Re: Build a pipeline using nutch

2012-02-15 Thread Markus Jelsma
my questions/doubts are inline On Tue, Feb 14, 2012 at 4:06 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Puneet, On Tue, Feb 14, 2012 at 5:12 AM, Puneet Pandey puneet...@gmail.com wrote: I have started using nutch recently. As I understand nutch crawling

Re: fetcher.threads.per.queue and fetcher.server.delay

2012-02-15 Thread Markus Jelsma
So I am trying to optimize the fetch performance, and I think that I miserably failing since I am not able to max out any my resources (cpu, ram, and more importantly bandwidth). obviously I am not trying to max out all of them at the same time. I just to find out the bottle neck, and I

Re: Build a pipeline using nutch

2012-02-15 Thread Markus Jelsma
Feb 14, 2012 at 4:06 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Puneet, On Tue, Feb 14, 2012 at 5:12 AM, Puneet Pandey puneet...@gmail.com wrote: I have started using nutch recently. As I understand nutch crawling is a cyclic process

Re: Build a pipeline using nutch

2012-02-15 Thread Magnús Skúlason
As it sounds to me its not obvious that you would want to use Nutch to deliver this functionality. What is it that you hope to get out of Nutch? Why not just write a simple java process using httpclient to fetch the pages from your other process? Or even wget them? and extract the content best

Re: Build a pipeline using nutch

2012-02-15 Thread remi tassing
Hi, Just a related question: Does.it make a big difference to fetch and parse directly than fetch all first, then parse. I was.under the impression that they yield.to the same end result Remi On Wednesday, February 15, 2012, Markus Jelsma mar...@apache.org wrote: my questions/doubts are