Just tried it in nutch-1.0 with the same kind of behavior:

hc.me...@server01:~/nutch-1.0> ./bin/nutch plugin urlnormalizer-regex
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
http://www.myinputurl.com
Plugin 'urlnormalizer-regex' not present or inactive.

(it is present and it is active through the plugin.includes property in
nutch-site.xml)

On Thu, Jun 24, 2010 at 5:45 PM, Julien Nioche <
[email protected]> wrote:

> the clue might be in : /~/apache-nutch-1.1-bin/plugins
> regenerate the job then delete this directory. Check where it gets the
> plugins from in the log file
>
>
> On 24 June 2010 16:11, Hannes Carl Meyer <[email protected]>wrote:
>
>> Nope, that changes nothing. Just checked out my log file:
>>
>> 2010-06-24 17:13:40,410 INFO  plugin.PluginRepository - Plugins: looking
>> in: /~/apache-nutch-1.1-bin/plugins
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         the nutch
>> core extension points (nutch-extensionpoints)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
>> Query Filter (query-basic)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
>> Indexing Filter (index-basic)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Html Parse
>> Plug-in (parse-html)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Site Query
>> Filter (query-site)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Http /
>> Https Protocol Plug-in (protocol-httpclient)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
>> Summarizer Plug-in (summary-basic)
>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         HTTP
>> Framework (lib-http)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Text Parse
>> Plug-in (parse-text)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Regex URL
>> Filter (urlfilter-regex)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Http
>> Protocol Plug-in (protocol-http)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         XML
>> Response Writer Plug-in (response-xml)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         OPIC
>> Scoring Plug-in (scoring-opic)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Tika
>> Parser Plug-in (parse-tika)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         CyberNeko
>> HTML Parser (lib-nekohtml)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Anchor
>> Indexing Filter (index-anchor)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         JavaScript
>> Parser (parse-js)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         URL Query
>> Filter (query-url)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Regex URL
>> Filter Framework (lib-regex-filter)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         JSON
>> Response Writer Plug-in (response-json)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
>> Summarizer (org.apache.nutch.searcher.Summarizer)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
>> Protocol (org.apache.nutch.protocol.Protocol)
>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Field Filter (org.apache.nutch.indexer.field.FieldFilter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Query Filter (org.apache.nutch.searcher.QueryFilter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Search Results Response Writer
>> (org.apache.nutch.searcher.response.ResponseWriter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch URL
>> Filter (org.apache.nutch.net.URLFilter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Online Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Content Parser (org.apache.nutch.parse.Parser)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
>> Scoring (org.apache.nutch.scoring.ScoringFilter)
>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Ontology
>> Model Loader (org.apache.nutch.ontology.Ontology)
>>
>> There is no RegexURLNormalizer being load...
>>
>>
>> On Thu, Jun 24, 2010 at 4:37 PM, Julien Nioche <
>> [email protected]> wrote:
>>
>>> OK. Since you are in distributed mode it should use the content of the
>>> job file. Try deleting ./build/plugins to see if this changes anything
>>>
>>>
>>> On 24 June 2010 15:30, Hannes Carl Meyer <[email protected]>wrote:
>>>
>>>> Jep, did not work, although it displays: "URL normalizing: true" in the
>>>> crawl process...
>>>> Also bin/nutch plugin ... does not work!
>>>>
>>>>
>>>> On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche <
>>>> [email protected]> wrote:
>>>>
>>>>> tried ant clean job?
>>>>>
>>>>>
>>>>>>  I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is
>>>>>> local).
>>>>>>
>>>>> When executing bin/nucht plugin ... I'm getting a "Plugin
>>>>>> 'urlnormalizer-regex' not present or inactive.". conf/nutch-site.xml
>>>>>> contains the property plugin.includes including urlnormalizer-regex.
>>>>>>
>>>>>
>>>>>> Starting the RegexURLNormalizer from within Eclipse is fine and it is
>>>>>> doing its job.
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Hannes
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Have you tried using :
>>>>>>> *./nutch plugin urlnormalizer-regex
>>>>>>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
>>>>>>> http://www.myinputurl.com*
>>>>>>> that should help finding where the problem is coming from.
>>>>>>>
>>>>>>> Are you running in distributed mode? Did you generate a new job file?
>>>>>>>
>>>>>>> J.
>>>>>>>
>>>>>>>
>>>>>>> On 24 June 2010 11:18, Hannes Carl Meyer 
>>>>>>> <[email protected]>wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm trying to strip a parameter from URLs using the
>>>>>>>> RegexURLNormalizer. I
>>>>>>>> added this to my nutch-site.xml:
>>>>>>>>
>>>>>>>>    <property>
>>>>>>>>        <name>urlnormalizer.order</name>
>>>>>>>>
>>>>>>>>
>>>>>>>> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
>>>>>>>>    </property>
>>>>>>>>
>>>>>>>>    <property>
>>>>>>>>        <name>urlnormalizer.regex.file</name>
>>>>>>>>        <value>regex-normalize.xml</value>
>>>>>>>>    </property>
>>>>>>>>
>>>>>>>> And defined this expression rule:
>>>>>>>>
>>>>>>>> <regex>
>>>>>>>>
>>>>>>>>
>>>>>>>> <pattern>(\?|&amp;)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&amp;|#|$)</pattern>
>>>>>>>>  <substitution>$1$5</substitution>
>>>>>>>> </regex>
>>>>>>>>
>>>>>>>> (to strip the parameter IFLBSERVERID from the URL)
>>>>>>>>
>>>>>>>> The indexed documents are still containing the parameter and imho
>>>>>>>> the
>>>>>>>> RegexURLNormalizer does not work. Is it something with:
>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-706 ?
>>>>>>>>
>>>>>>>> Thanks and regards
>>>>>>>>
>>>>>>>> Hannes
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>>>>> http://twitter.com/hannescarlmeyer
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> DigitalPebble Ltd
>>>>>>>
>>>>>>> Open Source Solutions for Text Engineering
>>>>>>> http://www.digitalpebble.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>>> http://twitter.com/hannescarlmeyer
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> DigitalPebble Ltd
>>>>>
>>>>> Open Source Solutions for Text Engineering
>>>>> http://www.digitalpebble.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>> http://twitter.com/hannescarlmeyer
>>>>
>>>
>>>
>>>
>>> --
>>> DigitalPebble Ltd
>>>
>>> Open Source Solutions for Text Engineering
>>> http://www.digitalpebble.com
>>>
>>
>>
>>
>> --
>>
>> https://www.xing.com/profile/HannesCarl_Meyer
>> http://de.linkedin.com/in/hannescarlmeyer
>> http://twitter.com/hannescarlmeyer
>>
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>

Reply via email to