Normalizer error: IndexOutOfBoundsException: No group 1

2012-04-02 Thread remi tassing
Hi all, I just found a weird error and it looks like a JDK bug but I'm not sure. Whenever replacing a URL-A, that contains a number, with a URL-B, then I get an error: IndexOutOfBoundsException: No group 1 In my regex-normalize.xml, I have: regex patternhttp://google1.com/.+/pattern

recrawl a single page explicit

2012-04-02 Thread Jan Riewe
Hi there, till now i did not find a way to crawl a specific page manuell. Is there a possibility manuell set the recrawl interval or the crawl date, or any other explicit way to make nutch invalidate a page? We have got 70k+ pages in the index and a full recrawl would take to long. Thanks Jan

Re: recrawl a single page explicit

2012-04-02 Thread Hannes Carl Meyer
Hi, we have kind of a similar case and we perform the following: 1 put all URLs you want to recrawl in the regex-urlfilter.txt 2 perform a bin/nutch mergedb with -filter param to strip those URLs from the crawl-db * 3 put the URLs from 1 into a seed file 4 remove the URLs from 1 from the

Re: recrawl a single page explicit

2012-04-02 Thread Markus Jelsma
The FreeGenerator tool is the easiest approach. On Mon, 2 Apr 2012 11:29:02 +0200, Hannes Carl Meyer hannesc...@googlemail.com wrote: Hi, we have kind of a similar case and we perform the following: 1 put all URLs you want to recrawl in the regex-urlfilter.txt 2 perform a bin/nutch mergedb

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2012-04-02 Thread jepse
Hey, i have the same problem. No urls to fetch.. For couple urls. Have no clou how to fix that. Did you meanwhile solve your problem? Cheers, Philipp -- View this message in context:

Re: crawling a website

2012-04-02 Thread remi tassing
It depends on the structure of your site and you can modify regex-urlfilter.txt to reach your goal. From the examples you gave, you can do this: *- ^http://ww.mywebsite.com/[^/]*$* it will exclude http://ww.mywebsite.com/alpha, http://ww.mywebsite.com/beta , http://ww.mywebsite.com/gamma *-

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2012-04-02 Thread remi tassing
It could be a million reasons: seed, filter, authentication...maybe the pages are already crawled... is there any clue in the log? Remi On Mon, Apr 2, 2012 at 5:37 PM, jepse j...@jepse.net wrote: Hey, i have the same problem. No urls to fetch.. For couple urls. Have no clou how to fix

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2012-04-02 Thread jepse
I have this site: http://www.soccer-forum.de/ When i put it into my browser its fine. No Robots.txt, no redirect... but simply no urls to fetch. where can i find the reasons for not fetching? What do you mean by saying seed list. Is there a way to disable all filters at once? -- View this

Re: Normalizer error: IndexOutOfBoundsException: No group 1

2012-04-02 Thread Sebastian Nagel
Hi Remi, it's not a bug, the substitution pattern is wrong. A captured group $1 is used but nothing is captured. The pattern should be: patternhttp://google1.com/(.+)/pattern Now $1 is defined and contains the part matched by .+ Beside, the rule regex

Re: crawling a website

2012-04-02 Thread alessio crisantemi
dear Remi, thank you for your reply but that's no good for my case. because the first command stop my crawling at the first section and the second stop it just at the start point. so, I see that the sectiond of my website have like a first page a urls with 'index.php' (EG:

Re: Normalizer error: IndexOutOfBoundsException: No group 1

2012-04-02 Thread remi tassing
True true, thanks! On Tue, Apr 3, 2012 at 3:08 AM, Sebastian Nagel wastl.na...@googlemail.comwrote: Hi Remi, it's not a bug, the substitution pattern is wrong. A captured group $1 is used but nothing is captured. The pattern should be: patternhttp://google1.com/(.**+)