Julien, thanks for all your help. but: If I delete ./plugins AND ./build/plugins he is trying to get them out of the nutch-1.1.job and fails. Maybe I'm just using a f*** up nutch-1.1 version, going to check on 1.0 now...
On Thu, Jun 24, 2010 at 5:45 PM, Julien Nioche < [email protected]> wrote: > the clue might be in : /~/apache-nutch-1.1-bin/plugins > regenerate the job then delete this directory. Check where it gets the > plugins from in the log file > > > On 24 June 2010 16:11, Hannes Carl Meyer <[email protected]>wrote: > >> Nope, that changes nothing. Just checked out my log file: >> >> 2010-06-24 17:13:40,410 INFO plugin.PluginRepository - Plugins: looking >> in: /~/apache-nutch-1.1-bin/plugins >> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Plugin >> Auto-activation mode: [true] >> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Registered >> Plugins: >> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - the nutch >> core extension points (nutch-extensionpoints) >> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic >> Query Filter (query-basic) >> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic >> Indexing Filter (index-basic) >> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Html Parse >> Plug-in (parse-html) >> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Site Query >> Filter (query-site) >> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Http / >> Https Protocol Plug-in (protocol-httpclient) >> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic >> Summarizer Plug-in (summary-basic) >> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - HTTP >> Framework (lib-http) >> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Text Parse >> Plug-in (parse-text) >> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Regex URL >> Filter (urlfilter-regex) >> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Http >> Protocol Plug-in (protocol-http) >> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - XML >> Response Writer Plug-in (response-xml) >> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - OPIC >> Scoring Plug-in (scoring-opic) >> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Tika >> Parser Plug-in (parse-tika) >> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - CyberNeko >> HTML Parser (lib-nekohtml) >> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Anchor >> Indexing Filter (index-anchor) >> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - JavaScript >> Parser (parse-js) >> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - URL Query >> Filter (query-url) >> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Regex URL >> Filter Framework (lib-regex-filter) >> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - JSON >> Response Writer Plug-in (response-json) >> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Registered >> Extension-Points: >> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch >> Summarizer (org.apache.nutch.searcher.Summarizer) >> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch >> Protocol (org.apache.nutch.protocol.Protocol) >> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch >> Analysis (org.apache.nutch.analysis.NutchAnalyzer) >> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch >> Field Filter (org.apache.nutch.indexer.field.FieldFilter) >> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - HTML Parse >> Filter (org.apache.nutch.parse.HtmlParseFilter) >> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch >> Query Filter (org.apache.nutch.searcher.QueryFilter) >> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch >> Search Results Response Writer >> (org.apache.nutch.searcher.response.ResponseWriter) >> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch URL >> Normalizer (org.apache.nutch.net.URLNormalizer) >> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch URL >> Filter (org.apache.nutch.net.URLFilter) >> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch >> Online Search Results Clustering Plugin >> (org.apache.nutch.clustering.OnlineClusterer) >> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch >> Indexing Filter (org.apache.nutch.indexer.IndexingFilter) >> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch >> Content Parser (org.apache.nutch.parse.Parser) >> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch >> Scoring (org.apache.nutch.scoring.ScoringFilter) >> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Ontology >> Model Loader (org.apache.nutch.ontology.Ontology) >> >> There is no RegexURLNormalizer being load... >> >> >> On Thu, Jun 24, 2010 at 4:37 PM, Julien Nioche < >> [email protected]> wrote: >> >>> OK. Since you are in distributed mode it should use the content of the >>> job file. Try deleting ./build/plugins to see if this changes anything >>> >>> >>> On 24 June 2010 15:30, Hannes Carl Meyer <[email protected]>wrote: >>> >>>> Jep, did not work, although it displays: "URL normalizing: true" in the >>>> crawl process... >>>> Also bin/nutch plugin ... does not work! >>>> >>>> >>>> On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche < >>>> [email protected]> wrote: >>>> >>>>> tried ant clean job? >>>>> >>>>> >>>>>> I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is >>>>>> local). >>>>>> >>>>> When executing bin/nucht plugin ... I'm getting a "Plugin >>>>>> 'urlnormalizer-regex' not present or inactive.". conf/nutch-site.xml >>>>>> contains the property plugin.includes including urlnormalizer-regex. >>>>>> >>>>> >>>>>> Starting the RegexURLNormalizer from within Eclipse is fine and it is >>>>>> doing its job. >>>>>> >>>>>> Regards >>>>>> >>>>>> Hannes >>>>>> >>>>>> >>>>>> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Have you tried using : >>>>>>> *./nutch plugin urlnormalizer-regex >>>>>>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer >>>>>>> http://www.myinputurl.com* >>>>>>> that should help finding where the problem is coming from. >>>>>>> >>>>>>> Are you running in distributed mode? Did you generate a new job file? >>>>>>> >>>>>>> J. >>>>>>> >>>>>>> >>>>>>> On 24 June 2010 11:18, Hannes Carl Meyer >>>>>>> <[email protected]>wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I'm trying to strip a parameter from URLs using the >>>>>>>> RegexURLNormalizer. I >>>>>>>> added this to my nutch-site.xml: >>>>>>>> >>>>>>>> <property> >>>>>>>> <name>urlnormalizer.order</name> >>>>>>>> >>>>>>>> >>>>>>>> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value> >>>>>>>> </property> >>>>>>>> >>>>>>>> <property> >>>>>>>> <name>urlnormalizer.regex.file</name> >>>>>>>> <value>regex-normalize.xml</value> >>>>>>>> </property> >>>>>>>> >>>>>>>> And defined this expression rule: >>>>>>>> >>>>>>>> <regex> >>>>>>>> >>>>>>>> >>>>>>>> <pattern>(\?|&)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&|#|$)</pattern> >>>>>>>> <substitution>$1$5</substitution> >>>>>>>> </regex> >>>>>>>> >>>>>>>> (to strip the parameter IFLBSERVERID from the URL) >>>>>>>> >>>>>>>> The indexed documents are still containing the parameter and imho >>>>>>>> the >>>>>>>> RegexURLNormalizer does not work. Is it something with: >>>>>>>> https://issues.apache.org/jira/browse/NUTCH-706 ? >>>>>>>> >>>>>>>> Thanks and regards >>>>>>>> >>>>>>>> Hannes >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> https://www.xing.com/profile/HannesCarl_Meyer >>>>>>>> http://de.linkedin.com/in/hannescarlmeyer >>>>>>>> http://twitter.com/hannescarlmeyer >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> DigitalPebble Ltd >>>>>>> >>>>>>> Open Source Solutions for Text Engineering >>>>>>> http://www.digitalpebble.com >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> https://www.xing.com/profile/HannesCarl_Meyer >>>>>> http://de.linkedin.com/in/hannescarlmeyer >>>>>> http://twitter.com/hannescarlmeyer >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> DigitalPebble Ltd >>>>> >>>>> Open Source Solutions for Text Engineering >>>>> http://www.digitalpebble.com >>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> https://www.xing.com/profile/HannesCarl_Meyer >>>> http://de.linkedin.com/in/hannescarlmeyer >>>> http://twitter.com/hannescarlmeyer >>>> >>> >>> >>> >>> -- >>> DigitalPebble Ltd >>> >>> Open Source Solutions for Text Engineering >>> http://www.digitalpebble.com >>> >> >> >> >> -- >> >> https://www.xing.com/profile/HannesCarl_Meyer >> http://de.linkedin.com/in/hannescarlmeyer >> http://twitter.com/hannescarlmeyer >> > > > > -- > DigitalPebble Ltd > > Open Source Solutions for Text Engineering > http://www.digitalpebble.com > -- https://www.xing.com/profile/HannesCarl_Meyer http://de.linkedin.com/in/hannescarlmeyer http://twitter.com/hannescarlmeyer

