Hi all,
I just found a weird error and it looks like a JDK bug but I'm not sure.
Whenever replacing a URL-A, that contains a number, with a URL-B, then I
get an error: IndexOutOfBoundsException: No group 1
In my regex-normalize.xml, I have:
regex
patternhttp://google1.com/.+/pattern
Hi there,
till now i did not find a way to crawl a specific page manuell.
Is there a possibility manuell set the recrawl interval or the crawl
date, or any other explicit way to make nutch invalidate a page?
We have got 70k+ pages in the index and a full recrawl would take to
long.
Thanks
Jan
Hi,
we have kind of a similar case and we perform the following:
1 put all URLs you want to recrawl in the regex-urlfilter.txt
2 perform a bin/nutch mergedb with -filter param to strip those URLs from
the crawl-db *
3 put the URLs from 1 into a seed file
4 remove the URLs from 1 from the
The FreeGenerator tool is the easiest approach.
On Mon, 2 Apr 2012 11:29:02 +0200, Hannes Carl Meyer
hannesc...@googlemail.com wrote:
Hi,
we have kind of a similar case and we perform the following:
1 put all URLs you want to recrawl in the regex-urlfilter.txt
2 perform a bin/nutch mergedb
Hey,
i have the same problem. No urls to fetch.. For couple urls. Have no clou
how to fix that. Did you meanwhile solve your problem?
Cheers, Philipp
--
View this message in context:
It depends on the structure of your site and you can modify
regex-urlfilter.txt to reach your goal.
From the examples you gave, you can do this:
*- ^http://ww.mywebsite.com/[^/]*$*
it will exclude http://ww.mywebsite.com/alpha, http://ww.mywebsite.com/beta
, http://ww.mywebsite.com/gamma
*-
It could be a million reasons: seed, filter, authentication...maybe the
pages are already crawled...
is there any clue in the log?
Remi
On Mon, Apr 2, 2012 at 5:37 PM, jepse j...@jepse.net wrote:
Hey,
i have the same problem. No urls to fetch.. For couple urls. Have no clou
how to fix
I have this site: http://www.soccer-forum.de/
When i put it into my browser its fine. No Robots.txt, no redirect... but
simply no urls to fetch. where can i find the reasons for not fetching?
What do you mean by saying seed list.
Is there a way to disable all filters at once?
--
View this
Hi Remi,
it's not a bug, the substitution pattern is wrong.
A captured group $1 is used but nothing is captured.
The pattern should be:
patternhttp://google1.com/(.+)/pattern
Now $1 is defined and contains the part matched by .+
Beside, the rule
regex
dear Remi,
thank you for your reply but that's no good for my case.
because the first command stop my crawling at the first section and the
second stop it just at the start point.
so, I see that the sectiond of my website have like a first page a urls
with 'index.php' (EG:
True true, thanks!
On Tue, Apr 3, 2012 at 3:08 AM, Sebastian Nagel
wastl.na...@googlemail.comwrote:
Hi Remi,
it's not a bug, the substitution pattern is wrong.
A captured group $1 is used but nothing is captured.
The pattern should be:
patternhttp://google1.com/(.**+)
11 matches
Mail list logo