Normalizer error: IndexOutOfBoundsException: No group 1

2012-04-02 Thread remi tassing
Hi all,

I just found a weird error and it looks like a JDK bug but I'm not sure.
Whenever replacing a URL-A, that contains a number, with a URL-B, then I
get an error: IndexOutOfBoundsException: No group 1

In my regex-normalize.xml, I have:
regex
  patternhttp://google1.com/.+/pattern
  substitutionhttp://google.com/$1/substitution
/regex

and trying:
echo 
'http://google2.com/whatever'|bin/nutchorg.apache.nutch.net.URLNormalizerChecker
gives:
Checking combination of all URLNormalizers available
Exception in thread main java.lang.IndexOutOfBoundsException: No group 1
at java.util.regex.Matcher.start(Matcher.java:374)
at java.util.regex.Matcher.appendReplacement(Matcher.java:830)
at java.util.regex.Matcher.replaceAll(Matcher.java:905)
at
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.regexNormalize(RegexURLNormalizer.java:181)
at
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.normalize(RegexURLNormalizer.java:188)
at
org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:286)
at
org.apache.nutch.net.URLNormalizerChecker.checkAll(URLNormalizerChecker.java:83)
at
org.apache.nutch.net.URLNormalizerChecker.main(URLNormalizerChecker.java:110)

Have you experienced this before?

Remi


recrawl a single page explicit

2012-04-02 Thread Jan Riewe
Hi there,

till now i did not find a way to crawl a specific page manuell. 
Is there a possibility manuell set the recrawl interval or the crawl
date, or any other explicit way to make nutch invalidate a page?

We have got 70k+ pages in the index and a full recrawl would take to
long.

Thanks 
Jan


Re: recrawl a single page explicit

2012-04-02 Thread Hannes Carl Meyer
Hi,

we have kind of a similar case and we perform the following:

1 put all URLs you want to recrawl in the regex-urlfilter.txt
2 perform a bin/nutch mergedb with -filter param to strip those URLs from
the crawl-db *
3 put the URLs from 1 into a seed file
4 remove the URLs from 1 from the regex-urlfilter.txt
5 start the crawl with the seed file from 3

* This is a merge on itself, for example: bin/nutch mergedb
$CRAWLFOLDER/NEWMergeDB $CRAWLFOLDER/crawldb -filter

I dunno wether this is the best way to do it, but since we automated it it
works very well.

Regards

Hannes

On Mon, Apr 2, 2012 at 11:07 AM, Jan Riewe jan.ri...@comspace.de wrote:

 Hi there,

 till now i did not find a way to crawl a specific page manuell.
 Is there a possibility manuell set the recrawl interval or the crawl
 date, or any other explicit way to make nutch invalidate a page?

 We have got 70k+ pages in the index and a full recrawl would take to
 long.

 Thanks
 Jan



Re: recrawl a single page explicit

2012-04-02 Thread Markus Jelsma

The FreeGenerator tool is the easiest approach.

On Mon, 2 Apr 2012 11:29:02 +0200, Hannes Carl Meyer 
hannesc...@googlemail.com wrote:

Hi,

we have kind of a similar case and we perform the following:

1 put all URLs you want to recrawl in the regex-urlfilter.txt
2 perform a bin/nutch mergedb with -filter param to strip those URLs 
from

the crawl-db *
3 put the URLs from 1 into a seed file
4 remove the URLs from 1 from the regex-urlfilter.txt
5 start the crawl with the seed file from 3

* This is a merge on itself, for example: bin/nutch mergedb
$CRAWLFOLDER/NEWMergeDB $CRAWLFOLDER/crawldb -filter

I dunno wether this is the best way to do it, but since we automated 
it it

works very well.

Regards

Hannes

On Mon, Apr 2, 2012 at 11:07 AM, Jan Riewe jan.ri...@comspace.de 
wrote:



Hi there,

till now i did not find a way to crawl a specific page manuell.
Is there a possibility manuell set the recrawl interval or the crawl
date, or any other explicit way to make nutch invalidate a page?

We have got 70k+ pages in the index and a full recrawl would take to
long.

Thanks
Jan



--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350


Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2012-04-02 Thread jepse
Hey,

i have the same problem. No urls to fetch.. For couple urls. Have no clou
how to fix that. Did you meanwhile solve your problem?

Cheers, Philipp

--
View this message in context: 
http://lucene.472066.n3.nabble.com/nutch-says-No-URLs-to-fetch-check-your-seed-list-and-URL-filters-when-trying-to-index-fmforums-com-tp738676p3877184.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: crawling a website

2012-04-02 Thread remi tassing
It depends on the structure of your site and you can modify
regex-urlfilter.txt to reach your goal.

From the examples you gave, you can do this:
*- ^http://ww.mywebsite.com/[^/]*$*
it will exclude  http://ww.mywebsite.com/alpha, http://ww.mywebsite.com/beta
, http://ww.mywebsite.com/gamma

*- ^http://ww.mywebsite.com/.*/$*
This will exclude any URL that ends with /

I would suggest you get familiar with regular expressions (in case you
don't yet)

Remi

On Sun, Apr 1, 2012 at 6:27 PM, alessio crisantemi 
alessio.crisant...@gmail.com wrote:

 Dear All,
 I would change my crawling operation but I don't know how can I do.

 crawling my website I used the follow command:

 $ bin/nutch crawl urls -solr http://localhost:8983/solr -threads 3 -depth
 35 -topN 10

 for crawl with nutch and index results on solr index.



 But I would not crawl the single section of my website but only the single
 pages.

 for example:

 You considere a site: www.mywebsite.com composed with 3 section:

 http://ww.mywebsite.com/alpha

 http://ww.mywebsite.com/beta

 http://ww.mywebsite.com/gamma



 so, I want between my results, only the single pages of my articles, and
 not the list of articles on this directories also.

 So, I would for example, the parsong of the file:

 http://ww.mywebsite.com/alpha/artcle1.html

 http://ww.mywebsite.com/alpha/artcle3.html

 ...



 and i don't want the parsing of the parent section:

 http://ww.mywebsite.com/alpha/



 How can I do?

 suggestion?

 sorry if not all clear

 thank you

 alessio



Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2012-04-02 Thread remi tassing
It could be a million reasons: seed, filter, authentication...maybe the
pages are already crawled...

is there any clue in the log?

Remi

On Mon, Apr 2, 2012 at 5:37 PM, jepse j...@jepse.net wrote:

 Hey,

 i have the same problem. No urls to fetch.. For couple urls. Have no clou
 how to fix that. Did you meanwhile solve your problem?

 Cheers, Philipp

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/nutch-says-No-URLs-to-fetch-check-your-seed-list-and-URL-filters-when-trying-to-index-fmforums-com-tp738676p3877184.html
 Sent from the Nutch - User mailing list archive at Nabble.com.



Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2012-04-02 Thread jepse
I have this site: http://www.soccer-forum.de/

When i put it into my browser its fine. No Robots.txt, no redirect... but
simply no urls to fetch.  where can i find the reasons for not fetching?
What do you mean by saying seed list. 

Is there a way to disable all filters at once?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/nutch-says-No-URLs-to-fetch-check-your-seed-list-and-URL-filters-when-trying-to-index-fmforums-com-tp738676p3877617.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Normalizer error: IndexOutOfBoundsException: No group 1

2012-04-02 Thread Sebastian Nagel

Hi Remi,

it's not a bug, the substitution pattern is wrong.
A captured group $1 is used but nothing is captured.
The pattern should be:

patternhttp://google1.com/(.+)/pattern

Now $1 is defined and contains the part matched by .+

Beside, the rule

regex
   pattern^http://google1\.com//pattern
   substitutionhttp://google.com//substitution
/regex

will do (almost) the same and should be faster - capturing
content has some cost.

Sebastian


On 04/02/2012 09:40 AM, remi tassing wrote:

Hi all,

I just found a weird error and it looks like a JDK bug but I'm not sure.
Whenever replacing a URL-A, that contains a number, with a URL-B, then I
get an error: IndexOutOfBoundsException: No group 1

In my regex-normalize.xml, I have:
regex
   patternhttp://google1.com/.+/pattern
   substitutionhttp://google.com/$1/substitution
/regex

and trying:
echo 
'http://google2.com/whatever'|bin/nutchorg.apache.nutch.net.URLNormalizerChecker
gives:
Checking combination of all URLNormalizers available
Exception in thread main java.lang.IndexOutOfBoundsException: No group 1
 at java.util.regex.Matcher.start(Matcher.java:374)
 at java.util.regex.Matcher.appendReplacement(Matcher.java:830)
 at java.util.regex.Matcher.replaceAll(Matcher.java:905)
 at
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.regexNormalize(RegexURLNormalizer.java:181)
 at
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.normalize(RegexURLNormalizer.java:188)
 at
org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:286)
 at
org.apache.nutch.net.URLNormalizerChecker.checkAll(URLNormalizerChecker.java:83)
 at
org.apache.nutch.net.URLNormalizerChecker.main(URLNormalizerChecker.java:110)

Have you experienced this before?

Remi





Re: crawling a website

2012-04-02 Thread alessio crisantemi
dear Remi,
thank you for your reply but that's no good for my case.
because the first command stop my crawling at the first section and the
second stop it just at the start point.

so, I see that the sectiond of my website have like a first page a urls
with 'index.php' (EG: http://ww.mywebsite.com/beta/index.php)
so, for crawl all this section (http://ww.mywebsite.com/beta) but for not
include the parsing of the http://ww.mywebsite.com/beta/index.php page)
wich is the correct command?

(may be the following?
*- ^http://ww.mywebsite.com/index-php$* ) or similar?
thanks
alessio



Il giorno 02 aprile 2012 11:40, remi tassing tassingr...@gmail.com ha
scritto:

 It depends on the structure of your site and you can modify
 regex-urlfilter.txt to reach your goal.

 From the examples you gave, you can do this:
 *- ^http://ww.mywebsite.com/[^/]*$*
 it will exclude  http://ww.mywebsite.com/alpha,
 http://ww.mywebsite.com/beta
 , http://ww.mywebsite.com/gamma

 *- ^http://ww.mywebsite.com/.*/$*
 This will exclude any URL that ends with /

 I would suggest you get familiar with regular expressions (in case you
 don't yet)

 Remi

 On Sun, Apr 1, 2012 at 6:27 PM, alessio crisantemi 
 alessio.crisant...@gmail.com wrote:

  Dear All,
  I would change my crawling operation but I don't know how can I do.
 
  crawling my website I used the follow command:
 
  $ bin/nutch crawl urls -solr http://localhost:8983/solr -threads 3
 -depth
  35 -topN 10
 
  for crawl with nutch and index results on solr index.
 
 
 
  But I would not crawl the single section of my website but only the
 single
  pages.
 
  for example:
 
  You considere a site: www.mywebsite.com composed with 3 section:
 
  http://ww.mywebsite.com/alpha
 
  http://ww.mywebsite.com/beta
 
  http://ww.mywebsite.com/gamma
 
 
 
  so, I want between my results, only the single pages of my articles, and
  not the list of articles on this directories also.
 
  So, I would for example, the parsong of the file:
 
  http://ww.mywebsite.com/alpha/artcle1.html
 
  http://ww.mywebsite.com/alpha/artcle3.html
 
  ...
 
 
 
  and i don't want the parsing of the parent section:
 
  http://ww.mywebsite.com/alpha/
 
 
 
  How can I do?
 
  suggestion?
 
  sorry if not all clear
 
  thank you
 
  alessio
 



Re: Normalizer error: IndexOutOfBoundsException: No group 1

2012-04-02 Thread remi tassing
True true, thanks!

On Tue, Apr 3, 2012 at 3:08 AM, Sebastian Nagel
wastl.na...@googlemail.comwrote:

 Hi Remi,

 it's not a bug, the substitution pattern is wrong.
 A captured group $1 is used but nothing is captured.
 The pattern should be:

patternhttp://google1.com/(.**+) http://google1.com/(.+)/pattern

 Now $1 is defined and contains the part matched by .+

 Beside, the rule

 regex
   pattern^http://google1\.com/**/pattern
   substitutionhttp://google.**com/ http://google.com//substitution
 /regex

 will do (almost) the same and should be faster - capturing
 content has some cost.

 Sebastian



 On 04/02/2012 09:40 AM, remi tassing wrote:

 Hi all,

 I just found a weird error and it looks like a JDK bug but I'm not sure.
 Whenever replacing a URL-A, that contains a number, with a URL-B, then I
 get an error: IndexOutOfBoundsException: No group 1

 In my regex-normalize.xml, I have:
 regex
   patternhttp://google1.com/.+**/pattern
   substitutionhttp://google.**com/$1 http://google.com/$1
 /substitution
 /regex

 and trying:
 echo 'http://google2.com/whatever'|**bin/nutchorg.apache.nutch.net.**
 URLNormalizerCheckerhttp://google2.com/whatever'%7Cbin/nutchorg.apache.nutch.net.URLNormalizerChecker
 gives:
 Checking combination of all URLNormalizers available
 Exception in thread main java.lang.**IndexOutOfBoundsException: No
 group 1
 at java.util.regex.Matcher.start(**Matcher.java:374)
 at java.util.regex.Matcher.**appendReplacement(Matcher.**
 java:830)
 at java.util.regex.Matcher.**replaceAll(Matcher.java:905)
 at
 org.apache.nutch.net.**urlnormalizer.regex.**RegexURLNormalizer.**
 regexNormalize(**RegexURLNormalizer.java:181)
 at
 org.apache.nutch.net.**urlnormalizer.regex.**
 RegexURLNormalizer.normalize(**RegexURLNormalizer.java:188)
 at
 org.apache.nutch.net.**URLNormalizers.normalize(**
 URLNormalizers.java:286)
 at
 org.apache.nutch.net.**URLNormalizerChecker.checkAll(**
 URLNormalizerChecker.java:83)
 at
 org.apache.nutch.net.**URLNormalizerChecker.main(**
 URLNormalizerChecker.java:110)

 Have you experienced this before?

 Remi