Nope, that changes nothing. Just checked out my log file:

2010-06-24 17:13:40,410 INFO  plugin.PluginRepository - Plugins: looking in:
/~/apache-nutch-1.1-bin/plugins
2010-06-24 17:13:41,439 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2010-06-24 17:13:41,439 INFO  plugin.PluginRepository - Registered Plugins:
2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         the nutch
core extension points (nutch-extensionpoints)
2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic Query
Filter (query-basic)
2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
Indexing Filter (index-basic)
2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Html Parse
Plug-in (parse-html)
2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Site Query
Filter (query-site)
2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Http / Https
Protocol Plug-in (protocol-httpclient)
2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
Summarizer Plug-in (summary-basic)
2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         HTTP
Framework (lib-http)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Text Parse
Plug-in (parse-text)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Regex URL
Filter (urlfilter-regex)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Http
Protocol Plug-in (protocol-http)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         XML Response
Writer Plug-in (response-xml)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Tika Parser
Plug-in (parse-tika)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         CyberNeko
HTML Parser (lib-nekohtml)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Anchor
Indexing Filter (index-anchor)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         JavaScript
Parser (parse-js)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         URL Query
Filter (query-url)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Regex URL
Filter Framework (lib-regex-filter)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         JSON
Response Writer Plug-in (response-json)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository - Registered
Extension-Points:
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch Field
Filter (org.apache.nutch.indexer.field.FieldFilter)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch Query
Filter (org.apache.nutch.searcher.QueryFilter)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch Search
Results Response Writer (org.apache.nutch.searcher.response.ResponseWriter)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch Online
Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
Content Parser (org.apache.nutch.parse.Parser)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Ontology
Model Loader (org.apache.nutch.ontology.Ontology)

There is no RegexURLNormalizer being load...

On Thu, Jun 24, 2010 at 4:37 PM, Julien Nioche <
[email protected]> wrote:

> OK. Since you are in distributed mode it should use the content of the job
> file. Try deleting ./build/plugins to see if this changes anything
>
>
> On 24 June 2010 15:30, Hannes Carl Meyer <[email protected]>wrote:
>
>> Jep, did not work, although it displays: "URL normalizing: true" in the
>> crawl process...
>> Also bin/nutch plugin ... does not work!
>>
>>
>> On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche <
>> [email protected]> wrote:
>>
>>> tried ant clean job?
>>>
>>>
>>>>  I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is
>>>> local).
>>>>
>>> When executing bin/nucht plugin ... I'm getting a "Plugin
>>>> 'urlnormalizer-regex' not present or inactive.". conf/nutch-site.xml
>>>> contains the property plugin.includes including urlnormalizer-regex.
>>>>
>>>
>>>> Starting the RegexURLNormalizer from within Eclipse is fine and it is
>>>> doing its job.
>>>>
>>>> Regards
>>>>
>>>> Hannes
>>>>
>>>>
>>>> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Have you tried using :
>>>>> *./nutch plugin urlnormalizer-regex
>>>>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
>>>>> http://www.myinputurl.com*
>>>>> that should help finding where the problem is coming from.
>>>>>
>>>>> Are you running in distributed mode? Did you generate a new job file?
>>>>>
>>>>> J.
>>>>>
>>>>>
>>>>> On 24 June 2010 11:18, Hannes Carl Meyer <[email protected]>wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm trying to strip a parameter from URLs using the
>>>>>> RegexURLNormalizer. I
>>>>>> added this to my nutch-site.xml:
>>>>>>
>>>>>>    <property>
>>>>>>        <name>urlnormalizer.order</name>
>>>>>>
>>>>>>
>>>>>> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
>>>>>>    </property>
>>>>>>
>>>>>>    <property>
>>>>>>        <name>urlnormalizer.regex.file</name>
>>>>>>        <value>regex-normalize.xml</value>
>>>>>>    </property>
>>>>>>
>>>>>> And defined this expression rule:
>>>>>>
>>>>>> <regex>
>>>>>>
>>>>>>
>>>>>> <pattern>(\?|&amp;)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&amp;|#|$)</pattern>
>>>>>>  <substitution>$1$5</substitution>
>>>>>> </regex>
>>>>>>
>>>>>> (to strip the parameter IFLBSERVERID from the URL)
>>>>>>
>>>>>> The indexed documents are still containing the parameter and imho the
>>>>>> RegexURLNormalizer does not work. Is it something with:
>>>>>> https://issues.apache.org/jira/browse/NUTCH-706 ?
>>>>>>
>>>>>> Thanks and regards
>>>>>>
>>>>>> Hannes
>>>>>>
>>>>>> --
>>>>>>
>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>>> http://twitter.com/hannescarlmeyer
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> DigitalPebble Ltd
>>>>>
>>>>> Open Source Solutions for Text Engineering
>>>>> http://www.digitalpebble.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>> http://twitter.com/hannescarlmeyer
>>>>
>>>
>>>
>>>
>>> --
>>> DigitalPebble Ltd
>>>
>>> Open Source Solutions for Text Engineering
>>> http://www.digitalpebble.com
>>>
>>
>>
>>
>> --
>>
>> https://www.xing.com/profile/HannesCarl_Meyer
>> http://de.linkedin.com/in/hannescarlmeyer
>> http://twitter.com/hannescarlmeyer
>>
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>



-- 

https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer
http://twitter.com/hannescarlmeyer

Reply via email to