Hi, if testing trunk via
echo 'http://www.xyz.com/...' \ | bin/nutch org.apache.nutch.net.URLNormalizerChecker the jsessionid is properly removed. The resulting URL is: http://www.xyz.com/site/hosa-technology-3-5mm-trs-to-1-4-trs-adapter/8561415.p?id=1208561582654&skuId=8561415&st=categoryid$abcat0207000&cp=1&lp=8 Although, I didn't test it, Nutch 1.7 should behave identical in this point. Best, Sebastian 2014-09-22 22:29 GMT+02:00 S.L <[email protected]>: > Sebastian , I am using Nutch 1.7 and a specific example in this case is > this. > > > http://www.xyz.com/site/hosa-technology-3-5mm-trs-to-1-4-trs-adapter/8561415.p;jsessionid=7936CA95263E9C78B735E5EBE827BDDA.bbolsp-app04-163?id=1208561582654&skuId=8561415&st=categoryid$abcat0207000&cp=1&lp=8 > > > > On Mon, Sep 22, 2014 at 4:12 PM, Sebastian Nagel < > [email protected] > > wrote: > > > > Looks like this should have been removed , is the regex in > > > regex-normalize.xml correct ? > > > > > > > Yes. It removes various session ids, see > > src/plugin/urlnormalizer-regex/sample/regex-normalize-default.test > > > > Can you give a concrete example of a session id not removed? > > Which Nutch version is used? > > > > Thanks, > > Sebastian > > > > On 09/22/2014 06:43 AM, S.L wrote: > > > The jsessionid on the cralwed URL is not being removed ,even though a > > regex > > > URL normalizer is beign specifiied, can someonle please let me know the > > > issue here ? > > > > > > I have already set the following > > > > > > *nutch-site.xml* > > > > > > <property> > > > <name>plugin.includes</name> > > > > > > > > > <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|scoring-optic|urlnormalizer-(pass|regex|basic) > > > </value> > > > </property> > > > > > > <property> > > > <name>urlnormalizer.order</name> > > > > > <value>org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer > > > org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer > > > </value> > > > <description>Order in which normalizers will run. If any of > these > > > isn't > > > activated it will be silently skipped. If other normalizers > > not > > > on the > > > list are activated, they will run in random order after the > > > ones > > > specified here are run. > > > </description> > > > </property> > > > > > > > > > <property> > > > <name>urlnormalizer.regex.file</name> > > > <value>regex-normalize.xml</value> > > > <description>Name of the config file used by the > > RegexUrlNormalizer > > > class. > > > </description> > > > </property> > > > > > > > > > *And the regex-normalize.xml file has this entry* > > > <regex> > > > > > > > > > <pattern>(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&|#|$)</pattern> > > > <substitution>$4</substitution> > > > </regex> > > > > > > > > > Looks like this should have been removed , is the regex in > > > regex-normalize.xml correct ? > > > > > > > >

