Hi,

if testing trunk via

  echo 'http://www.xyz.com/...' \
     | bin/nutch org.apache.nutch.net.URLNormalizerChecker

the jsessionid is properly removed. The resulting URL is:

http://www.xyz.com/site/hosa-technology-3-5mm-trs-to-1-4-trs-adapter/8561415.p?id=1208561582654&skuId=8561415&st=categoryid$abcat0207000&cp=1&lp=8

Although, I didn't test it, Nutch 1.7 should behave identical in this point.

Best,
Sebastian

2014-09-22 22:29 GMT+02:00 S.L <[email protected]>:

> Sebastian , I am using Nutch 1.7 and a specific example in this case is
> this.
>
>
> http://www.xyz.com/site/hosa-technology-3-5mm-trs-to-1-4-trs-adapter/8561415.p;jsessionid=7936CA95263E9C78B735E5EBE827BDDA.bbolsp-app04-163?id=1208561582654&skuId=8561415&st=categoryid$abcat0207000&cp=1&lp=8
>
>
>
> On Mon, Sep 22, 2014 at 4:12 PM, Sebastian Nagel <
> [email protected]
> > wrote:
>
> > > Looks like this should have been removed , is the regex in
> > > regex-normalize.xml correct ?
> > >
> >
> > Yes. It removes various session ids, see
> > src/plugin/urlnormalizer-regex/sample/regex-normalize-default.test
> >
> > Can you give a concrete example of a session id not removed?
> > Which Nutch version is used?
> >
> > Thanks,
> > Sebastian
> >
> > On 09/22/2014 06:43 AM, S.L wrote:
> > > The jsessionid on the cralwed URL is not being removed ,even though a
> > regex
> > > URL normalizer is beign specifiied, can someonle please let me know the
> > > issue here ?
> > >
> > > I have already set the following
> > >
> > > *nutch-site.xml*
> > >
> > >     <property>
> > >         <name>plugin.includes</name>
> > >
> > >
> >
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|scoring-optic|urlnormalizer-(pass|regex|basic)
> > >         </value>
> > >     </property>
> > >
> > >     <property>
> > >         <name>urlnormalizer.order</name>
> > >
> >  <value>org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
> > > org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
> > >         </value>
> > >         <description>Order in which normalizers will run. If any of
> these
> > >             isn't
> > >             activated it will be silently skipped. If other normalizers
> > not
> > >             on the
> > >             list are activated, they will run in random order after the
> > >             ones
> > >             specified here are run.
> > >         </description>
> > >     </property>
> > >
> > >
> > >     <property>
> > >         <name>urlnormalizer.regex.file</name>
> > >         <value>regex-normalize.xml</value>
> > >         <description>Name of the config file used by the
> > RegexUrlNormalizer
> > >             class.
> > >         </description>
> > >     </property>
> > >
> > >
> > > *And the regex-normalize.xml file has this entry*
> > > <regex>
> > >
> > >
> >
> <pattern>(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>
> > >   <substitution>$4</substitution>
> > > </regex>
> > >
> > >
> > > Looks like this should have been removed , is the regex in
> > > regex-normalize.xml correct ?
> > >
> >
> >
>

Reply via email to