Normalizer error: IndexOutOfBoundsException: No group 1
Hi all, I just found a weird error and it looks like a JDK bug but I'm not sure. Whenever replacing a URL-A, that contains a number, with a URL-B, then I get an error: IndexOutOfBoundsException: No group 1 In my regex-normalize.xml, I have: regex patternhttp://google1.com/.+/pattern substitutionhttp://google.com/$1/substitution /regex and trying: echo 'http://google2.com/whatever'|bin/nutchorg.apache.nutch.net.URLNormalizerChecker gives: Checking combination of all URLNormalizers available Exception in thread main java.lang.IndexOutOfBoundsException: No group 1 at java.util.regex.Matcher.start(Matcher.java:374) at java.util.regex.Matcher.appendReplacement(Matcher.java:830) at java.util.regex.Matcher.replaceAll(Matcher.java:905) at org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.regexNormalize(RegexURLNormalizer.java:181) at org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.normalize(RegexURLNormalizer.java:188) at org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:286) at org.apache.nutch.net.URLNormalizerChecker.checkAll(URLNormalizerChecker.java:83) at org.apache.nutch.net.URLNormalizerChecker.main(URLNormalizerChecker.java:110) Have you experienced this before? Remi
recrawl a single page explicit
Hi there, till now i did not find a way to crawl a specific page manuell. Is there a possibility manuell set the recrawl interval or the crawl date, or any other explicit way to make nutch invalidate a page? We have got 70k+ pages in the index and a full recrawl would take to long. Thanks Jan
Re: recrawl a single page explicit
Hi, we have kind of a similar case and we perform the following: 1 put all URLs you want to recrawl in the regex-urlfilter.txt 2 perform a bin/nutch mergedb with -filter param to strip those URLs from the crawl-db * 3 put the URLs from 1 into a seed file 4 remove the URLs from 1 from the regex-urlfilter.txt 5 start the crawl with the seed file from 3 * This is a merge on itself, for example: bin/nutch mergedb $CRAWLFOLDER/NEWMergeDB $CRAWLFOLDER/crawldb -filter I dunno wether this is the best way to do it, but since we automated it it works very well. Regards Hannes On Mon, Apr 2, 2012 at 11:07 AM, Jan Riewe jan.ri...@comspace.de wrote: Hi there, till now i did not find a way to crawl a specific page manuell. Is there a possibility manuell set the recrawl interval or the crawl date, or any other explicit way to make nutch invalidate a page? We have got 70k+ pages in the index and a full recrawl would take to long. Thanks Jan
Re: recrawl a single page explicit
The FreeGenerator tool is the easiest approach. On Mon, 2 Apr 2012 11:29:02 +0200, Hannes Carl Meyer hannesc...@googlemail.com wrote: Hi, we have kind of a similar case and we perform the following: 1 put all URLs you want to recrawl in the regex-urlfilter.txt 2 perform a bin/nutch mergedb with -filter param to strip those URLs from the crawl-db * 3 put the URLs from 1 into a seed file 4 remove the URLs from 1 from the regex-urlfilter.txt 5 start the crawl with the seed file from 3 * This is a merge on itself, for example: bin/nutch mergedb $CRAWLFOLDER/NEWMergeDB $CRAWLFOLDER/crawldb -filter I dunno wether this is the best way to do it, but since we automated it it works very well. Regards Hannes On Mon, Apr 2, 2012 at 11:07 AM, Jan Riewe jan.ri...@comspace.de wrote: Hi there, till now i did not find a way to crawl a specific page manuell. Is there a possibility manuell set the recrawl interval or the crawl date, or any other explicit way to make nutch invalidate a page? We have got 70k+ pages in the index and a full recrawl would take to long. Thanks Jan -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350
Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com
Hey, i have the same problem. No urls to fetch.. For couple urls. Have no clou how to fix that. Did you meanwhile solve your problem? Cheers, Philipp -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-says-No-URLs-to-fetch-check-your-seed-list-and-URL-filters-when-trying-to-index-fmforums-com-tp738676p3877184.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: crawling a website
It depends on the structure of your site and you can modify regex-urlfilter.txt to reach your goal. From the examples you gave, you can do this: *- ^http://ww.mywebsite.com/[^/]*$* it will exclude http://ww.mywebsite.com/alpha, http://ww.mywebsite.com/beta , http://ww.mywebsite.com/gamma *- ^http://ww.mywebsite.com/.*/$* This will exclude any URL that ends with / I would suggest you get familiar with regular expressions (in case you don't yet) Remi On Sun, Apr 1, 2012 at 6:27 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: Dear All, I would change my crawling operation but I don't know how can I do. crawling my website I used the follow command: $ bin/nutch crawl urls -solr http://localhost:8983/solr -threads 3 -depth 35 -topN 10 for crawl with nutch and index results on solr index. But I would not crawl the single section of my website but only the single pages. for example: You considere a site: www.mywebsite.com composed with 3 section: http://ww.mywebsite.com/alpha http://ww.mywebsite.com/beta http://ww.mywebsite.com/gamma so, I want between my results, only the single pages of my articles, and not the list of articles on this directories also. So, I would for example, the parsong of the file: http://ww.mywebsite.com/alpha/artcle1.html http://ww.mywebsite.com/alpha/artcle3.html ... and i don't want the parsing of the parent section: http://ww.mywebsite.com/alpha/ How can I do? suggestion? sorry if not all clear thank you alessio
Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com
It could be a million reasons: seed, filter, authentication...maybe the pages are already crawled... is there any clue in the log? Remi On Mon, Apr 2, 2012 at 5:37 PM, jepse j...@jepse.net wrote: Hey, i have the same problem. No urls to fetch.. For couple urls. Have no clou how to fix that. Did you meanwhile solve your problem? Cheers, Philipp -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-says-No-URLs-to-fetch-check-your-seed-list-and-URL-filters-when-trying-to-index-fmforums-com-tp738676p3877184.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com
I have this site: http://www.soccer-forum.de/ When i put it into my browser its fine. No Robots.txt, no redirect... but simply no urls to fetch. where can i find the reasons for not fetching? What do you mean by saying seed list. Is there a way to disable all filters at once? -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-says-No-URLs-to-fetch-check-your-seed-list-and-URL-filters-when-trying-to-index-fmforums-com-tp738676p3877617.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Normalizer error: IndexOutOfBoundsException: No group 1
Hi Remi, it's not a bug, the substitution pattern is wrong. A captured group $1 is used but nothing is captured. The pattern should be: patternhttp://google1.com/(.+)/pattern Now $1 is defined and contains the part matched by .+ Beside, the rule regex pattern^http://google1\.com//pattern substitutionhttp://google.com//substitution /regex will do (almost) the same and should be faster - capturing content has some cost. Sebastian On 04/02/2012 09:40 AM, remi tassing wrote: Hi all, I just found a weird error and it looks like a JDK bug but I'm not sure. Whenever replacing a URL-A, that contains a number, with a URL-B, then I get an error: IndexOutOfBoundsException: No group 1 In my regex-normalize.xml, I have: regex patternhttp://google1.com/.+/pattern substitutionhttp://google.com/$1/substitution /regex and trying: echo 'http://google2.com/whatever'|bin/nutchorg.apache.nutch.net.URLNormalizerChecker gives: Checking combination of all URLNormalizers available Exception in thread main java.lang.IndexOutOfBoundsException: No group 1 at java.util.regex.Matcher.start(Matcher.java:374) at java.util.regex.Matcher.appendReplacement(Matcher.java:830) at java.util.regex.Matcher.replaceAll(Matcher.java:905) at org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.regexNormalize(RegexURLNormalizer.java:181) at org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.normalize(RegexURLNormalizer.java:188) at org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:286) at org.apache.nutch.net.URLNormalizerChecker.checkAll(URLNormalizerChecker.java:83) at org.apache.nutch.net.URLNormalizerChecker.main(URLNormalizerChecker.java:110) Have you experienced this before? Remi
Re: crawling a website
dear Remi, thank you for your reply but that's no good for my case. because the first command stop my crawling at the first section and the second stop it just at the start point. so, I see that the sectiond of my website have like a first page a urls with 'index.php' (EG: http://ww.mywebsite.com/beta/index.php) so, for crawl all this section (http://ww.mywebsite.com/beta) but for not include the parsing of the http://ww.mywebsite.com/beta/index.php page) wich is the correct command? (may be the following? *- ^http://ww.mywebsite.com/index-php$* ) or similar? thanks alessio Il giorno 02 aprile 2012 11:40, remi tassing tassingr...@gmail.com ha scritto: It depends on the structure of your site and you can modify regex-urlfilter.txt to reach your goal. From the examples you gave, you can do this: *- ^http://ww.mywebsite.com/[^/]*$* it will exclude http://ww.mywebsite.com/alpha, http://ww.mywebsite.com/beta , http://ww.mywebsite.com/gamma *- ^http://ww.mywebsite.com/.*/$* This will exclude any URL that ends with / I would suggest you get familiar with regular expressions (in case you don't yet) Remi On Sun, Apr 1, 2012 at 6:27 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: Dear All, I would change my crawling operation but I don't know how can I do. crawling my website I used the follow command: $ bin/nutch crawl urls -solr http://localhost:8983/solr -threads 3 -depth 35 -topN 10 for crawl with nutch and index results on solr index. But I would not crawl the single section of my website but only the single pages. for example: You considere a site: www.mywebsite.com composed with 3 section: http://ww.mywebsite.com/alpha http://ww.mywebsite.com/beta http://ww.mywebsite.com/gamma so, I want between my results, only the single pages of my articles, and not the list of articles on this directories also. So, I would for example, the parsong of the file: http://ww.mywebsite.com/alpha/artcle1.html http://ww.mywebsite.com/alpha/artcle3.html ... and i don't want the parsing of the parent section: http://ww.mywebsite.com/alpha/ How can I do? suggestion? sorry if not all clear thank you alessio
Re: Normalizer error: IndexOutOfBoundsException: No group 1
True true, thanks! On Tue, Apr 3, 2012 at 3:08 AM, Sebastian Nagel wastl.na...@googlemail.comwrote: Hi Remi, it's not a bug, the substitution pattern is wrong. A captured group $1 is used but nothing is captured. The pattern should be: patternhttp://google1.com/(.**+) http://google1.com/(.+)/pattern Now $1 is defined and contains the part matched by .+ Beside, the rule regex pattern^http://google1\.com/**/pattern substitutionhttp://google.**com/ http://google.com//substitution /regex will do (almost) the same and should be faster - capturing content has some cost. Sebastian On 04/02/2012 09:40 AM, remi tassing wrote: Hi all, I just found a weird error and it looks like a JDK bug but I'm not sure. Whenever replacing a URL-A, that contains a number, with a URL-B, then I get an error: IndexOutOfBoundsException: No group 1 In my regex-normalize.xml, I have: regex patternhttp://google1.com/.+**/pattern substitutionhttp://google.**com/$1 http://google.com/$1 /substitution /regex and trying: echo 'http://google2.com/whatever'|**bin/nutchorg.apache.nutch.net.** URLNormalizerCheckerhttp://google2.com/whatever'%7Cbin/nutchorg.apache.nutch.net.URLNormalizerChecker gives: Checking combination of all URLNormalizers available Exception in thread main java.lang.**IndexOutOfBoundsException: No group 1 at java.util.regex.Matcher.start(**Matcher.java:374) at java.util.regex.Matcher.**appendReplacement(Matcher.** java:830) at java.util.regex.Matcher.**replaceAll(Matcher.java:905) at org.apache.nutch.net.**urlnormalizer.regex.**RegexURLNormalizer.** regexNormalize(**RegexURLNormalizer.java:181) at org.apache.nutch.net.**urlnormalizer.regex.** RegexURLNormalizer.normalize(**RegexURLNormalizer.java:188) at org.apache.nutch.net.**URLNormalizers.normalize(** URLNormalizers.java:286) at org.apache.nutch.net.**URLNormalizerChecker.checkAll(** URLNormalizerChecker.java:83) at org.apache.nutch.net.**URLNormalizerChecker.main(** URLNormalizerChecker.java:110) Have you experienced this before? Remi