Thanks, Feng. I thought this might be a common problem when using nutch. I'll try your suggestions.
Best Regards, Jun Zhou University of Southern California http://www-scf.usc.edu/~junzhou On Wed, Apr 10, 2013 at 7:11 AM, feng lu <[email protected]> wrote: > Hi Jun > > Can you use one regex pattern to match all special situations. or maybe you > can extend your own url normalizer plugin to fit your requirement. > > > On Wed, Apr 10, 2013 at 8:17 PM, Rajani Maski <[email protected]> > wrote: > > > Hi, > > > > I think this thread should be useful: > > > > > http://lucene.472066.n3.nabble.com/Parsed-content-in-form-of-special-characters-td4047239.html > > > > > > > > Thanks & Regards > > Rajani Maski > > > > > > > > On Sun, Apr 7, 2013 at 4:56 AM, Jun Zhou <[email protected]> wrote: > > > > > Hi all, > > > > > > I'm using nutch 1.6 to crawl a web site which have lots of special > > > characters in the url, like "?,=@" etc. For each character, I can add > a > > > regex in the regex-normalize.xml to change it into percent encoding. > > > > > > My question is, is there an easier way to do this? Like a url-encode > > method > > > to encode all the special characters rather than add regex one by one? > > > > > > Thanks! > > > > > > > > > -- > Don't Grow Old, Grow Up... :-) >

