Re: encode special characters in url
Hi, I think this thread should be useful: http://lucene.472066.n3.nabble.com/Parsed-content-in-form-of-special-characters-td4047239.html Thanks Regards Rajani Maski On Sun, Apr 7, 2013 at 4:56 AM, Jun Zhou zhouju...@gmail.com wrote: Hi all, I'm using nutch 1.6 to crawl a web site which have lots of special characters in the url, like ?,=@ etc. For each character, I can add a regex in the regex-normalize.xml to change it into percent encoding. My question is, is there an easier way to do this? Like a url-encode method to encode all the special characters rather than add regex one by one? Thanks!
Re: encode special characters in url
Hi Jun Can you use one regex pattern to match all special situations. or maybe you can extend your own url normalizer plugin to fit your requirement. On Wed, Apr 10, 2013 at 8:17 PM, Rajani Maski rajinima...@gmail.com wrote: Hi, I think this thread should be useful: http://lucene.472066.n3.nabble.com/Parsed-content-in-form-of-special-characters-td4047239.html Thanks Regards Rajani Maski On Sun, Apr 7, 2013 at 4:56 AM, Jun Zhou zhouju...@gmail.com wrote: Hi all, I'm using nutch 1.6 to crawl a web site which have lots of special characters in the url, like ?,=@ etc. For each character, I can add a regex in the regex-normalize.xml to change it into percent encoding. My question is, is there an easier way to do this? Like a url-encode method to encode all the special characters rather than add regex one by one? Thanks! -- Don't Grow Old, Grow Up... :-)
Re: encode special characters in url
Thanks Rajani! Actually the problem is special characters in the url, not in the content. Thanks anyway! On Wed, Apr 10, 2013 at 5:17 AM, Rajani Maski rajinima...@gmail.com wrote: Hi, I think this thread should be useful: http://lucene.472066.n3.nabble.com/Parsed-content-in-form-of-special-characters-td4047239.html Thanks Regards Rajani Maski On Sun, Apr 7, 2013 at 4:56 AM, Jun Zhou zhouju...@gmail.com wrote: Hi all, I'm using nutch 1.6 to crawl a web site which have lots of special characters in the url, like ?,=@ etc. For each character, I can add a regex in the regex-normalize.xml to change it into percent encoding. My question is, is there an easier way to do this? Like a url-encode method to encode all the special characters rather than add regex one by one? Thanks!
Re: encode special characters in url
Thanks, Feng. I thought this might be a common problem when using nutch. I'll try your suggestions. Best Regards, Jun Zhou University of Southern California http://www-scf.usc.edu/~junzhou On Wed, Apr 10, 2013 at 7:11 AM, feng lu amuseme...@gmail.com wrote: Hi Jun Can you use one regex pattern to match all special situations. or maybe you can extend your own url normalizer plugin to fit your requirement. On Wed, Apr 10, 2013 at 8:17 PM, Rajani Maski rajinima...@gmail.com wrote: Hi, I think this thread should be useful: http://lucene.472066.n3.nabble.com/Parsed-content-in-form-of-special-characters-td4047239.html Thanks Regards Rajani Maski On Sun, Apr 7, 2013 at 4:56 AM, Jun Zhou zhouju...@gmail.com wrote: Hi all, I'm using nutch 1.6 to crawl a web site which have lots of special characters in the url, like ?,=@ etc. For each character, I can add a regex in the regex-normalize.xml to change it into percent encoding. My question is, is there an easier way to do this? Like a url-encode method to encode all the special characters rather than add regex one by one? Thanks! -- Don't Grow Old, Grow Up... :-)
encode special characters in url
Hi all, I'm using nutch 1.6 to crawl a web site which have lots of special characters in the url, like ?,=@ etc. For each character, I can add a regex in the regex-normalize.xml to change it into percent encoding. My question is, is there an easier way to do this? Like a url-encode method to encode all the special characters rather than add regex one by one? Thanks!