Re: encode special characters in url

2013-04-10 Thread Rajani Maski
Hi,

 I think this thread should be useful:
http://lucene.472066.n3.nabble.com/Parsed-content-in-form-of-special-characters-td4047239.html



Thanks  Regards
Rajani Maski



On Sun, Apr 7, 2013 at 4:56 AM, Jun Zhou zhouju...@gmail.com wrote:

 Hi all,

 I'm using nutch 1.6 to crawl a web site which have lots of special
 characters in the url, like ?,=@ etc.  For each character, I can add a
 regex in the regex-normalize.xml to change it into percent encoding.

 My question is, is there an easier way to do this? Like a url-encode method
 to encode all the special characters rather than add regex one by one?

 Thanks!



Re: encode special characters in url

2013-04-10 Thread feng lu
Hi Jun

Can you use one regex pattern to match all special situations. or maybe you
can extend your own url normalizer plugin to fit your requirement.


On Wed, Apr 10, 2013 at 8:17 PM, Rajani Maski rajinima...@gmail.com wrote:

 Hi,

  I think this thread should be useful:

 http://lucene.472066.n3.nabble.com/Parsed-content-in-form-of-special-characters-td4047239.html



 Thanks  Regards
 Rajani Maski



 On Sun, Apr 7, 2013 at 4:56 AM, Jun Zhou zhouju...@gmail.com wrote:

  Hi all,
 
  I'm using nutch 1.6 to crawl a web site which have lots of special
  characters in the url, like ?,=@ etc.  For each character, I can add a
  regex in the regex-normalize.xml to change it into percent encoding.
 
  My question is, is there an easier way to do this? Like a url-encode
 method
  to encode all the special characters rather than add regex one by one?
 
  Thanks!
 




-- 
Don't Grow Old, Grow Up... :-)


Re: encode special characters in url

2013-04-10 Thread Jun Zhou
Thanks Rajani!

Actually the problem is special characters in the url, not in the content.
Thanks anyway!


On Wed, Apr 10, 2013 at 5:17 AM, Rajani Maski rajinima...@gmail.com wrote:

 Hi,

  I think this thread should be useful:

 http://lucene.472066.n3.nabble.com/Parsed-content-in-form-of-special-characters-td4047239.html



 Thanks  Regards
 Rajani Maski



 On Sun, Apr 7, 2013 at 4:56 AM, Jun Zhou zhouju...@gmail.com wrote:

  Hi all,
 
  I'm using nutch 1.6 to crawl a web site which have lots of special
  characters in the url, like ?,=@ etc.  For each character, I can add a
  regex in the regex-normalize.xml to change it into percent encoding.
 
  My question is, is there an easier way to do this? Like a url-encode
 method
  to encode all the special characters rather than add regex one by one?
 
  Thanks!
 



Re: encode special characters in url

2013-04-10 Thread Jun Zhou
Thanks, Feng.

I thought this might be a common problem when using nutch. I'll try your
suggestions.


Best Regards,
Jun Zhou
University of Southern California
http://www-scf.usc.edu/~junzhou


On Wed, Apr 10, 2013 at 7:11 AM, feng lu amuseme...@gmail.com wrote:

 Hi Jun

 Can you use one regex pattern to match all special situations. or maybe you
 can extend your own url normalizer plugin to fit your requirement.


 On Wed, Apr 10, 2013 at 8:17 PM, Rajani Maski rajinima...@gmail.com
 wrote:

  Hi,
 
   I think this thread should be useful:
 
 
 http://lucene.472066.n3.nabble.com/Parsed-content-in-form-of-special-characters-td4047239.html
 
 
 
  Thanks  Regards
  Rajani Maski
 
 
 
  On Sun, Apr 7, 2013 at 4:56 AM, Jun Zhou zhouju...@gmail.com wrote:
 
   Hi all,
  
   I'm using nutch 1.6 to crawl a web site which have lots of special
   characters in the url, like ?,=@ etc.  For each character, I can add
 a
   regex in the regex-normalize.xml to change it into percent encoding.
  
   My question is, is there an easier way to do this? Like a url-encode
  method
   to encode all the special characters rather than add regex one by one?
  
   Thanks!
  
 



 --
 Don't Grow Old, Grow Up... :-)



encode special characters in url

2013-04-06 Thread Jun Zhou
Hi all,

I'm using nutch 1.6 to crawl a web site which have lots of special
characters in the url, like ?,=@ etc.  For each character, I can add a
regex in the regex-normalize.xml to change it into percent encoding.

My question is, is there an easier way to do this? Like a url-encode method
to encode all the special characters rather than add regex one by one?

Thanks!