Re: questions regarding nutch url normalizer

Sebastian Nagel Thu, 11 Jul 2013 13:23:30 -0700

I would strongly recommend to test the normalizer(s) before crawling.
There are two handy tools, to see what you get after normalization:


echo "http://www.example/(sndjnc22e3r3r))/abc.com" \
  | $NUTCH_HOME/bin/nutch org.apache.nutch.net.URLNormalizerChecker

$NUTCH_HOME/bin/nutch plugin urlnormalizer-regex \
  org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer <url>

And yes, you can combine this with the URL filter checker:

cat urls.txt \
  | $NUTCH_HOME/bin/nutch org.apache.nutch.net.URLNormalizerChecker \
  | $NUTCH_HOME/bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined

On 07/11/2013 07:59 AM, devang pandey wrote:
> Hello , I am working on nutch 1.2 to crawl a site . Now few urls are like
> www.example/(sndjnc22e3r3r))/abc.com. I want to strip out this part inside
> brackets to normalize my urls . For this I wrote a regex in my regex
> normalizer and substituted it . Now I am crawling again but still not able
> to get proper results.
> 
> Please guide me in solving this issue
>

Re: questions regarding nutch url normalizer

Reply via email to