Jep, did not work, although it displays: "URL normalizing: true" in the
crawl process...
Also bin/nutch plugin ... does not work!

On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche <
[email protected]> wrote:

> tried ant clean job?
>
>
>>  I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is local).
>>
> When executing bin/nucht plugin ... I'm getting a "Plugin
>> 'urlnormalizer-regex' not present or inactive.". conf/nutch-site.xml
>> contains the property plugin.includes including urlnormalizer-regex.
>>
>
>> Starting the RegexURLNormalizer from within Eclipse is fine and it is
>> doing its job.
>>
>> Regards
>>
>> Hannes
>>
>>
>> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche <
>> [email protected]> wrote:
>>
>>> Hi,
>>>
>>> Have you tried using :
>>> *./nutch plugin urlnormalizer-regex
>>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
>>> http://www.myinputurl.com*
>>> that should help finding where the problem is coming from.
>>>
>>> Are you running in distributed mode? Did you generate a new job file?
>>>
>>> J.
>>>
>>>
>>> On 24 June 2010 11:18, Hannes Carl Meyer <[email protected]>wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm trying to strip a parameter from URLs using the RegexURLNormalizer.
>>>> I
>>>> added this to my nutch-site.xml:
>>>>
>>>>    <property>
>>>>        <name>urlnormalizer.order</name>
>>>>
>>>>
>>>> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
>>>>    </property>
>>>>
>>>>    <property>
>>>>        <name>urlnormalizer.regex.file</name>
>>>>        <value>regex-normalize.xml</value>
>>>>    </property>
>>>>
>>>> And defined this expression rule:
>>>>
>>>> <regex>
>>>>
>>>>
>>>> <pattern>(\?|&amp;)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&amp;|#|$)</pattern>
>>>>  <substitution>$1$5</substitution>
>>>> </regex>
>>>>
>>>> (to strip the parameter IFLBSERVERID from the URL)
>>>>
>>>> The indexed documents are still containing the parameter and imho the
>>>> RegexURLNormalizer does not work. Is it something with:
>>>> https://issues.apache.org/jira/browse/NUTCH-706 ?
>>>>
>>>> Thanks and regards
>>>>
>>>> Hannes
>>>>
>>>> --
>>>>
>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>> http://twitter.com/hannescarlmeyer
>>>>
>>>
>>>
>>>
>>> --
>>> DigitalPebble Ltd
>>>
>>> Open Source Solutions for Text Engineering
>>> http://www.digitalpebble.com
>>>
>>
>>
>>
>> --
>>
>> https://www.xing.com/profile/HannesCarl_Meyer
>> http://de.linkedin.com/in/hannescarlmeyer
>> http://twitter.com/hannescarlmeyer
>>
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>



-- 

https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer
http://twitter.com/hannescarlmeyer

Reply via email to