Thanks, Feng.

I thought this might be a common problem when using nutch. I'll try your
suggestions.


Best Regards,
Jun Zhou
University of Southern California
http://www-scf.usc.edu/~junzhou


On Wed, Apr 10, 2013 at 7:11 AM, feng lu <[email protected]> wrote:

> Hi Jun
>
> Can you use one regex pattern to match all special situations. or maybe you
> can extend your own url normalizer plugin to fit your requirement.
>
>
> On Wed, Apr 10, 2013 at 8:17 PM, Rajani Maski <[email protected]>
> wrote:
>
> > Hi,
> >
> >  I think this thread should be useful:
> >
> >
> http://lucene.472066.n3.nabble.com/Parsed-content-in-form-of-special-characters-td4047239.html
> >
> >
> >
> > Thanks & Regards
> > Rajani Maski
> >
> >
> >
> > On Sun, Apr 7, 2013 at 4:56 AM, Jun Zhou <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > I'm using nutch 1.6 to crawl a web site which have lots of special
> > > characters in the url, like "?,=@" etc.  For each character, I can add
> a
> > > regex in the regex-normalize.xml to change it into percent encoding.
> > >
> > > My question is, is there an easier way to do this? Like a url-encode
> > method
> > > to encode all the special characters rather than add regex one by one?
> > >
> > > Thanks!
> > >
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Reply via email to