hi hannes,
i have identified your problem.
your nutch-site.xml plugin.includes property contains a newline after
urlnormalizer-(basic|pass|regex), which breaks pattern matching in
PluginRepository.java.
<property>
<name>plugin.includes</name>
<value>protocol-http|protocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(basic|pass|regex)
</value>
</property>
if i remove the newline before </value>, it is ok.
regards
reinhard
Hannes Carl Meyer schrieb:
> Just tried it in nutch-1.0 with the same kind of behavior:
>
> hc.me...@server01:~/nutch-1.0> ./bin/nutch plugin urlnormalizer-regex
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
> http://www.myinputurl.com
> Plugin 'urlnormalizer-regex' not present or inactive.
>
> (it is present and it is active through the plugin.includes property in
> nutch-site.xml)
>
> On Thu, Jun 24, 2010 at 5:45 PM, Julien Nioche <
> [email protected]> wrote:
>
>
>> the clue might be in : /~/apache-nutch-1.1-bin/plugins
>> regenerate the job then delete this directory. Check where it gets the
>> plugins from in the log file
>>
>>
>> On 24 June 2010 16:11, Hannes Carl Meyer <[email protected]>wrote:
>>
>>
>>> Nope, that changes nothing. Just checked out my log file:
>>>
>>> 2010-06-24 17:13:40,410 INFO plugin.PluginRepository - Plugins: looking
>>> in: /~/apache-nutch-1.1-bin/plugins
>>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Plugin
>>> Auto-activation mode: [true]
>>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Registered
>>> Plugins:
>>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - the nutch
>>> core extension points (nutch-extensionpoints)
>>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic
>>> Query Filter (query-basic)
>>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic
>>> Indexing Filter (index-basic)
>>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Html Parse
>>> Plug-in (parse-html)
>>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Site Query
>>> Filter (query-site)
>>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Http /
>>> Https Protocol Plug-in (protocol-httpclient)
>>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic
>>> Summarizer Plug-in (summary-basic)
>>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - HTTP
>>> Framework (lib-http)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Text Parse
>>> Plug-in (parse-text)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Regex URL
>>> Filter (urlfilter-regex)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Http
>>> Protocol Plug-in (protocol-http)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - XML
>>> Response Writer Plug-in (response-xml)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - OPIC
>>> Scoring Plug-in (scoring-opic)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Tika
>>> Parser Plug-in (parse-tika)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - CyberNeko
>>> HTML Parser (lib-nekohtml)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Anchor
>>> Indexing Filter (index-anchor)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - JavaScript
>>> Parser (parse-js)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - URL Query
>>> Filter (query-url)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Regex URL
>>> Filter Framework (lib-regex-filter)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - JSON
>>> Response Writer Plug-in (response-json)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Registered
>>> Extension-Points:
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch
>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch
>>> Protocol (org.apache.nutch.protocol.Protocol)
>>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch
>>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>>> Field Filter (org.apache.nutch.indexer.field.FieldFilter)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - HTML Parse
>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>>> Query Filter (org.apache.nutch.searcher.QueryFilter)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>>> Search Results Response Writer
>>> (org.apache.nutch.searcher.response.ResponseWriter)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch URL
>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch URL
>>> Filter (org.apache.nutch.net.URLFilter)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>>> Online Search Results Clustering Plugin
>>> (org.apache.nutch.clustering.OnlineClusterer)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>>> Content Parser (org.apache.nutch.parse.Parser)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch
>>> Scoring (org.apache.nutch.scoring.ScoringFilter)
>>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Ontology
>>> Model Loader (org.apache.nutch.ontology.Ontology)
>>>
>>> There is no RegexURLNormalizer being load...
>>>
>>>
>>> On Thu, Jun 24, 2010 at 4:37 PM, Julien Nioche <
>>> [email protected]> wrote:
>>>
>>>
>>>> OK. Since you are in distributed mode it should use the content of the
>>>> job file. Try deleting ./build/plugins to see if this changes anything
>>>>
>>>>
>>>> On 24 June 2010 15:30, Hannes Carl Meyer <[email protected]>wrote:
>>>>
>>>>
>>>>> Jep, did not work, although it displays: "URL normalizing: true" in the
>>>>> crawl process...
>>>>> Also bin/nutch plugin ... does not work!
>>>>>
>>>>>
>>>>> On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche <
>>>>> [email protected]> wrote:
>>>>>
>>>>>
>>>>>> tried ant clean job?
>>>>>>
>>>>>>
>>>>>>
>>>>>>> I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is
>>>>>>> local).
>>>>>>>
>>>>>>>
>>>>>> When executing bin/nucht plugin ... I'm getting a "Plugin
>>>>>>
>>>>>>> 'urlnormalizer-regex' not present or inactive.". conf/nutch-site.xml
>>>>>>> contains the property plugin.includes including urlnormalizer-regex.
>>>>>>>
>>>>>>>
>>>>>>> Starting the RegexURLNormalizer from within Eclipse is fine and it is
>>>>>>> doing its job.
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Hannes
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Have you tried using :
>>>>>>>> *./nutch plugin urlnormalizer-regex
>>>>>>>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
>>>>>>>> http://www.myinputurl.com*
>>>>>>>> that should help finding where the problem is coming from.
>>>>>>>>
>>>>>>>> Are you running in distributed mode? Did you generate a new job file?
>>>>>>>>
>>>>>>>> J.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 24 June 2010 11:18, Hannes Carl Meyer
>>>>>>>> <[email protected]>wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'm trying to strip a parameter from URLs using the
>>>>>>>>> RegexURLNormalizer. I
>>>>>>>>> added this to my nutch-site.xml:
>>>>>>>>>
>>>>>>>>> <property>
>>>>>>>>> <name>urlnormalizer.order</name>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
>>>>>>>>> </property>
>>>>>>>>>
>>>>>>>>> <property>
>>>>>>>>> <name>urlnormalizer.regex.file</name>
>>>>>>>>> <value>regex-normalize.xml</value>
>>>>>>>>> </property>
>>>>>>>>>
>>>>>>>>> And defined this expression rule:
>>>>>>>>>
>>>>>>>>> <regex>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> <pattern>(\?|&)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&|#|$)</pattern>
>>>>>>>>> <substitution>$1$5</substitution>
>>>>>>>>> </regex>
>>>>>>>>>
>>>>>>>>> (to strip the parameter IFLBSERVERID from the URL)
>>>>>>>>>
>>>>>>>>> The indexed documents are still containing the parameter and imho
>>>>>>>>> the
>>>>>>>>> RegexURLNormalizer does not work. Is it something with:
>>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-706 ?
>>>>>>>>>
>>>>>>>>> Thanks and regards
>>>>>>>>>
>>>>>>>>> Hannes
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>>>>>> http://twitter.com/hannescarlmeyer
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> DigitalPebble Ltd
>>>>>>>>
>>>>>>>> Open Source Solutions for Text Engineering
>>>>>>>> http://www.digitalpebble.com
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>>>> http://twitter.com/hannescarlmeyer
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> DigitalPebble Ltd
>>>>>>
>>>>>> Open Source Solutions for Text Engineering
>>>>>> http://www.digitalpebble.com
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> https://www.xing.com/profile/HannesCarl_Meyer
>>>>> http://de.linkedin.com/in/hannescarlmeyer
>>>>> http://twitter.com/hannescarlmeyer
>>>>>
>>>>>
>>>>
>>>> --
>>>> DigitalPebble Ltd
>>>>
>>>> Open Source Solutions for Text Engineering
>>>> http://www.digitalpebble.com
>>>>
>>>>
>>>
>>> --
>>>
>>> https://www.xing.com/profile/HannesCarl_Meyer
>>> http://de.linkedin.com/in/hannescarlmeyer
>>> http://twitter.com/hannescarlmeyer
>>>
>>>
>>
>> --
>> DigitalPebble Ltd
>>
>> Open Source Solutions for Text Engineering
>> http://www.digitalpebble.com
>>
>>
>
>