Re: Root slash being stripped from file path

2013-03-28 Thread Bai Shen
Sorry. I'm using 2.1. I did a general web search and didn't find any instances of the problem. I found a couple tutorials using the file:///data/mydir format with no mention of any issues. The problem is that the normalizers(not sure which one) strip out that leading / which changes the url

Re: Root slash being stripped from file path

2013-03-28 Thread Bai Shen
Finally found it in JIRA. https://issues.apache.org/jira/browse/NUTCH-1483 I'll give the patch a try and see if that fixes my issue. On Wed, Mar 27, 2013 at 4:29 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Nutch version please? Sebastian and others worked on this a while ago.

How to set politeness in Nutch 2.1?

2013-03-28 Thread Yves S. Garret
Hi, I got a really embarrassing question. After googling for this answer for some time, I can't find out how to set the politeness level when I crawl through a site. I don't want to bombard a site. Any thoughts or pointers on how to do this?

Fwd: How to set politeness in Nutch 2.1?

2013-03-28 Thread Yves S. Garret
I was able to look into ${APACHE_NUTCH_HOME}/conf/nutch-default.xml and it listed a very good explanation of each term that I can use to throttle my crawling. I should be all set for now unless there's something that I'm seriously not getting. -- Forwarded message -- From: Yves

Re: Root slash being stripped from file path

2013-03-28 Thread Lewis John Mcgibbney
Please also see https://issues.apache.org/jira/browse/NUTCH-1484 Sebastien resolved this one off and AFAIK fixed the solution. On Thu, Mar 28, 2013 at 6:09 AM, Bai Shen baishen.li...@gmail.com wrote: Finally found it in JIRA. https://issues.apache.org/jira/browse/NUTCH-1483 I'll give the

error using generate in 2.x

2013-03-28 Thread kaveh minooie
Hi everyone anybody has any idea why i am getting this error when i run generate right after i inject to a new crawlId in local mode (that is not to say that this doesn't happen in deploy mode or on a preexisting crawlID, i just haven't test those) 2013-03-28 11:06:21,911 INFO