The jsessionid on the cralwed URL is not being removed ,even though a regex
URL normalizer is beign specifiied, can someonle please let me know the
issue here ?
I have already set the following
*nutch-site.xml*
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|scoring-optic|urlnormalizer-(pass|regex|basic)
</value>
</property>
<property>
<name>urlnormalizer.order</name>
<value>org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
</value>
<description>Order in which normalizers will run. If any of these
isn't
activated it will be silently skipped. If other normalizers not
on the
list are activated, they will run in random order after the
ones
specified here are run.
</description>
</property>
<property>
<name>urlnormalizer.regex.file</name>
<value>regex-normalize.xml</value>
<description>Name of the config file used by the RegexUrlNormalizer
class.
</description>
</property>
*And the regex-normalize.xml file has this entry*
<regex>
<pattern>(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&|#|$)</pattern>
<substitution>$4</substitution>
</regex>
Looks like this should have been removed , is the regex in
regex-normalize.xml correct ?