The jsessionid on the cralwed URL is not being removed ,even though a regex
URL normalizer is beign specifiied, can someonle please let me know the
issue here ?

I have already set the following

*nutch-site.xml*

    <property>
        <name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|scoring-optic|urlnormalizer-(pass|regex|basic)
        </value>
    </property>

    <property>
        <name>urlnormalizer.order</name>
        <value>org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
        </value>
        <description>Order in which normalizers will run. If any of these
            isn't
            activated it will be silently skipped. If other normalizers not
            on the
            list are activated, they will run in random order after the
            ones
            specified here are run.
        </description>
    </property>


    <property>
        <name>urlnormalizer.regex.file</name>
        <value>regex-normalize.xml</value>
        <description>Name of the config file used by the RegexUrlNormalizer
            class.
        </description>
    </property>


*And the regex-normalize.xml file has this entry*
<regex>

<pattern>(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>
  <substitution>$4</substitution>
</regex>


Looks like this should have been removed , is the regex in
regex-normalize.xml correct ?

Reply via email to