I don't think I configured such things, how can I be sure ?
- Message d'origine -
De : Lewis John Mcgibbney
Envoyés : 14.02.12 19:18
À : user@nutch.apache.org
Objet : Re: fetcher.max.crawl.delay = -1 doesn't work?
Hi Danicela, Before I try this, have you configured any other overrides
I just used protocol-http and it works!
It's probably a configuration issue. You can download a clean version and
start afresh
Remi
On Wed, Feb 15, 2012 at 3:46 AM, tiagorcs dasilva-ti...@mitsue.co.jpwrote:
So do you suggest me to download Nutch from a different source? Maybe to
reconfigure
Hello all,
What does tstamp represent? I can we shown in Solr results after indexing.
I'm interested in showing the last modified meta-data in Solr results but
I'm not sure if Nutch does retrieve this value.
Thanks in advance for the help!
Remi
Hi,
Tika is parsing properly, I think it was some kind of proxy issue and also
the http.content.limit.
Thanks!
Remi
On Fri, Feb 10, 2012 at 11:16 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Remi,
Please ensure that your http.content limit is sufficient, what are you url
iirc time stamp represents when page was last fetched. Yes you should be
able to specify this value in your schema and get it mapped to solr index.
Last modified is when the actual page was last modified e.g. when there was
a change to the page source or something.
On Wed, Feb 15, 2012 at 1:26
Hey Lewis,
Thanks for the clarification!
For tstamp, I can actually see it in Solr results (even thought the format
is weird)
How can I get Last-Modified value in Solr as well? Does Nutch need to be
configured in some way?
Remi
On Wed, Feb 15, 2012 at 3:46 PM, Lewis John Mcgibbney
Hi Remi,
On Wed, Feb 15, 2012 at 1:51 PM, remi tassing tassingr...@gmail.com wrote:
Thanks for the clarification!
nb
For tstamp, I can actually see it in Solr results (even thought the format
is weird)
what is the format?
How can I get Last-Modified value in Solr as well? Does Nutch
Hi,
tstamp shows a string of digits like 20020123123212
Never heard of the plugin index-more and it's poorly documented. After
adding this to plugins.include, I'll need to run solrindex or is it
necessary to re-parse or recrawl (I think this less likely IMO)?
Thanks again
Remi
On Wednesday,
Hi,
On Wed, Feb 15, 2012 at 4:00 PM, remi tassing tassingr...@gmail.com wrote:
tstamp shows a string of digits like 20020123123212
This is OK. -mm-dd-hh-mm-ssZ It is however hellishly old !
Never heard of the plugin index-more and it's poorly documented.
Well it's been included in
Is it any quick way to see the impact of index-more? I deleted the parse
related folders in the segment and re-parsed it but when I readseg there is
no.difference
On Wednesday, February 15, 2012, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi,
On Wed, Feb 15, 2012 at 4:00 PM,
Remi, I had a similar problem but for a custom field that I was trying to post
to Solr (via solrindex) as a type=date in the schema.xml. Turns out my date
string was formatted incorrectly (it was missing the trailing Z). From the
error message it appears that perhaps the field into which this
You're both correct, after changing the type for tstamp and lastModified
from long to date, no error anymore.
Next thing I need to do is setup cygwin/svn to be able to get fresh
svn/trunch code...it's so cool to be up-to-date. Nutch-1.4 is just
ridiculously faster than 1.2 :-)
Thanks!!
Remi
On
You're both correct, after changing the type for tstamp and lastModified
from long to date, no error anymore.
Next thing I need to do is setup cygwin/svn to be able to get fresh
svn/trunch code...it's so cool to be up-to-date. Nutch-1.4 is just
ridiculously faster than 1.2 :-)
Is it
So I am trying to optimize the fetch performance, and I think that I
miserably failing since I am not able to max out any my resources (cpu,
ram, and more importantly bandwidth). obviously I am not trying to max
out all of them at the same time. I just to find out the bottle neck,
and I can't
It could be interesting finding out what exactly causes such huge speed
difference. For me the speed increase is on the 10x order...crazy!
On Wed, Feb 15, 2012 at 9:35 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
You're both correct, after changing the type for tstamp and lastModified
Another question I should have asked is how long is the crawl delay in
robots.txt?
If you read the fetcher.max.crawl.delay property description it explicitly
notes that the fetcher will wait however long it is required by robots.tx
until it fetches the page.
Do you have this information?
Thanks
my questions/doubts are inline
On Tue, Feb 14, 2012 at 4:06 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Puneet,
On Tue, Feb 14, 2012 at 5:12 AM, Puneet Pandey puneet...@gmail.com
wrote:
I have started using nutch recently.
As I understand nutch crawling
So I am trying to optimize the fetch performance, and I think that I
miserably failing since I am not able to max out any my resources (cpu,
ram, and more importantly bandwidth). obviously I am not trying to max
out all of them at the same time. I just to find out the bottle neck,
and I
Feb 14, 2012 at 4:06 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Puneet,
On Tue, Feb 14, 2012 at 5:12 AM, Puneet Pandey puneet...@gmail.com
wrote:
I have started using nutch recently.
As I understand nutch crawling is a cyclic process
As it sounds to me its not obvious that you would want to use Nutch to
deliver this functionality. What is it that you hope to get out of
Nutch?
Why not just write a simple java process using httpclient to fetch the
pages from your other process? Or even wget them? and extract the
content
best
Hi,
Just a related question: Does.it make a big difference to fetch and parse
directly than fetch all first, then parse. I was.under the impression that
they yield.to the same end result
Remi
On Wednesday, February 15, 2012, Markus Jelsma mar...@apache.org wrote:
my questions/doubts are
21 matches
Mail list logo