Nope, that changes nothing. Just checked out my log file: 2010-06-24 17:13:40,410 INFO plugin.PluginRepository - Plugins: looking in: /~/apache-nutch-1.1-bin/plugins 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Registered Plugins: 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Site Query Filter (query-site) 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient) 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - XML Response Writer Plug-in (response-xml) 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Tika Parser Plug-in (parse-tika) 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor) 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - URL Query Filter (query-url) 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - JSON Response Writer Plug-in (response-json) 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Registered Extension-Points: 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch Field Filter (org.apache.nutch.indexer.field.FieldFilter) 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter) 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch Search Results Response Writer (org.apache.nutch.searcher.response.ResponseWriter) 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer) 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology)
There is no RegexURLNormalizer being load... On Thu, Jun 24, 2010 at 4:37 PM, Julien Nioche < [email protected]> wrote: > OK. Since you are in distributed mode it should use the content of the job > file. Try deleting ./build/plugins to see if this changes anything > > > On 24 June 2010 15:30, Hannes Carl Meyer <[email protected]>wrote: > >> Jep, did not work, although it displays: "URL normalizing: true" in the >> crawl process... >> Also bin/nutch plugin ... does not work! >> >> >> On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche < >> [email protected]> wrote: >> >>> tried ant clean job? >>> >>> >>>> I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is >>>> local). >>>> >>> When executing bin/nucht plugin ... I'm getting a "Plugin >>>> 'urlnormalizer-regex' not present or inactive.". conf/nutch-site.xml >>>> contains the property plugin.includes including urlnormalizer-regex. >>>> >>> >>>> Starting the RegexURLNormalizer from within Eclipse is fine and it is >>>> doing its job. >>>> >>>> Regards >>>> >>>> Hannes >>>> >>>> >>>> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche < >>>> [email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> Have you tried using : >>>>> *./nutch plugin urlnormalizer-regex >>>>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer >>>>> http://www.myinputurl.com* >>>>> that should help finding where the problem is coming from. >>>>> >>>>> Are you running in distributed mode? Did you generate a new job file? >>>>> >>>>> J. >>>>> >>>>> >>>>> On 24 June 2010 11:18, Hannes Carl Meyer <[email protected]>wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I'm trying to strip a parameter from URLs using the >>>>>> RegexURLNormalizer. I >>>>>> added this to my nutch-site.xml: >>>>>> >>>>>> <property> >>>>>> <name>urlnormalizer.order</name> >>>>>> >>>>>> >>>>>> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value> >>>>>> </property> >>>>>> >>>>>> <property> >>>>>> <name>urlnormalizer.regex.file</name> >>>>>> <value>regex-normalize.xml</value> >>>>>> </property> >>>>>> >>>>>> And defined this expression rule: >>>>>> >>>>>> <regex> >>>>>> >>>>>> >>>>>> <pattern>(\?|&)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&|#|$)</pattern> >>>>>> <substitution>$1$5</substitution> >>>>>> </regex> >>>>>> >>>>>> (to strip the parameter IFLBSERVERID from the URL) >>>>>> >>>>>> The indexed documents are still containing the parameter and imho the >>>>>> RegexURLNormalizer does not work. Is it something with: >>>>>> https://issues.apache.org/jira/browse/NUTCH-706 ? >>>>>> >>>>>> Thanks and regards >>>>>> >>>>>> Hannes >>>>>> >>>>>> -- >>>>>> >>>>>> https://www.xing.com/profile/HannesCarl_Meyer >>>>>> http://de.linkedin.com/in/hannescarlmeyer >>>>>> http://twitter.com/hannescarlmeyer >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> DigitalPebble Ltd >>>>> >>>>> Open Source Solutions for Text Engineering >>>>> http://www.digitalpebble.com >>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> https://www.xing.com/profile/HannesCarl_Meyer >>>> http://de.linkedin.com/in/hannescarlmeyer >>>> http://twitter.com/hannescarlmeyer >>>> >>> >>> >>> >>> -- >>> DigitalPebble Ltd >>> >>> Open Source Solutions for Text Engineering >>> http://www.digitalpebble.com >>> >> >> >> >> -- >> >> https://www.xing.com/profile/HannesCarl_Meyer >> http://de.linkedin.com/in/hannescarlmeyer >> http://twitter.com/hannescarlmeyer >> > > > > -- > DigitalPebble Ltd > > Open Source Solutions for Text Engineering > http://www.digitalpebble.com > -- https://www.xing.com/profile/HannesCarl_Meyer http://de.linkedin.com/in/hannescarlmeyer http://twitter.com/hannescarlmeyer

