Awesome... Thank you very very much :-) On Thu, Jun 24, 2010 at 6:55 PM, reinhard schwab <[email protected]>wrote:
> hi hannes, > > i have identified your problem. > your nutch-site.xml plugin.includes property contains a newline after > urlnormalizer-(basic|pass|regex), which breaks pattern matching in > PluginRepository.java. > > <property> > <name>plugin.includes</name> > > > <value>protocol-http|protocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(basic|pass|regex) > </value> > </property> > > if i remove the newline before </value>, it is ok. > > regards > reinhard > > Hannes Carl Meyer schrieb: > > Just tried it in nutch-1.0 with the same kind of behavior: > > > > hc.me...@server01:~/nutch-1.0> ./bin/nutch plugin urlnormalizer-regex > > org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer > > http://www.myinputurl.com > > Plugin 'urlnormalizer-regex' not present or inactive. > > > > (it is present and it is active through the plugin.includes property in > > nutch-site.xml) > > > > On Thu, Jun 24, 2010 at 5:45 PM, Julien Nioche < > > [email protected]> wrote: > > > > > >> the clue might be in : /~/apache-nutch-1.1-bin/plugins > >> regenerate the job then delete this directory. Check where it gets the > >> plugins from in the log file > >> > >> > >> On 24 June 2010 16:11, Hannes Carl Meyer <[email protected] > >wrote: > >> > >> > >>> Nope, that changes nothing. Just checked out my log file: > >>> > >>> 2010-06-24 17:13:40,410 INFO plugin.PluginRepository - Plugins: > looking > >>> in: /~/apache-nutch-1.1-bin/plugins > >>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Plugin > >>> Auto-activation mode: [true] > >>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Registered > >>> Plugins: > >>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - the > nutch > >>> core extension points (nutch-extensionpoints) > >>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic > >>> Query Filter (query-basic) > >>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic > >>> Indexing Filter (index-basic) > >>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Html > Parse > >>> Plug-in (parse-html) > >>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Site > Query > >>> Filter (query-site) > >>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Http / > >>> Https Protocol Plug-in (protocol-httpclient) > >>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Basic > >>> Summarizer Plug-in (summary-basic) > >>> 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - HTTP > >>> Framework (lib-http) > >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Text > Parse > >>> Plug-in (parse-text) > >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Regex > URL > >>> Filter (urlfilter-regex) > >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Http > >>> Protocol Plug-in (protocol-http) > >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - XML > >>> Response Writer Plug-in (response-xml) > >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - OPIC > >>> Scoring Plug-in (scoring-opic) > >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Tika > >>> Parser Plug-in (parse-tika) > >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - > CyberNeko > >>> HTML Parser (lib-nekohtml) > >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Anchor > >>> Indexing Filter (index-anchor) > >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - > JavaScript > >>> Parser (parse-js) > >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - URL > Query > >>> Filter (query-url) > >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Regex > URL > >>> Filter Framework (lib-regex-filter) > >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - JSON > >>> Response Writer Plug-in (response-json) > >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Registered > >>> Extension-Points: > >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch > >>> Summarizer (org.apache.nutch.searcher.Summarizer) > >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch > >>> Protocol (org.apache.nutch.protocol.Protocol) > >>> 2010-06-24 17:13:41,450 INFO plugin.PluginRepository - Nutch > >>> Analysis (org.apache.nutch.analysis.NutchAnalyzer) > >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch > >>> Field Filter (org.apache.nutch.indexer.field.FieldFilter) > >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - HTML > Parse > >>> Filter (org.apache.nutch.parse.HtmlParseFilter) > >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch > >>> Query Filter (org.apache.nutch.searcher.QueryFilter) > >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch > >>> Search Results Response Writer > >>> (org.apache.nutch.searcher.response.ResponseWriter) > >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch > URL > >>> Normalizer (org.apache.nutch.net.URLNormalizer) > >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch > URL > >>> Filter (org.apache.nutch.net.URLFilter) > >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch > >>> Online Search Results Clustering Plugin > >>> (org.apache.nutch.clustering.OnlineClusterer) > >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch > >>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter) > >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch > >>> Content Parser (org.apache.nutch.parse.Parser) > >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - Nutch > >>> Scoring (org.apache.nutch.scoring.ScoringFilter) > >>> 2010-06-24 17:13:41,451 INFO plugin.PluginRepository - > Ontology > >>> Model Loader (org.apache.nutch.ontology.Ontology) > >>> > >>> There is no RegexURLNormalizer being load... > >>> > >>> > >>> On Thu, Jun 24, 2010 at 4:37 PM, Julien Nioche < > >>> [email protected]> wrote: > >>> > >>> > >>>> OK. Since you are in distributed mode it should use the content of the > >>>> job file. Try deleting ./build/plugins to see if this changes anything > >>>> > >>>> > >>>> On 24 June 2010 15:30, Hannes Carl Meyer <[email protected] > >wrote: > >>>> > >>>> > >>>>> Jep, did not work, although it displays: "URL normalizing: true" in > the > >>>>> crawl process... > >>>>> Also bin/nutch plugin ... does not work! > >>>>> > >>>>> > >>>>> On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche < > >>>>> [email protected]> wrote: > >>>>> > >>>>> > >>>>>> tried ant clean job? > >>>>>> > >>>>>> > >>>>>> > >>>>>>> I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is > >>>>>>> local). > >>>>>>> > >>>>>>> > >>>>>> When executing bin/nucht plugin ... I'm getting a "Plugin > >>>>>> > >>>>>>> 'urlnormalizer-regex' not present or inactive.". > conf/nutch-site.xml > >>>>>>> contains the property plugin.includes including > urlnormalizer-regex. > >>>>>>> > >>>>>>> > >>>>>>> Starting the RegexURLNormalizer from within Eclipse is fine and it > is > >>>>>>> doing its job. > >>>>>>> > >>>>>>> Regards > >>>>>>> > >>>>>>> Hannes > >>>>>>> > >>>>>>> > >>>>>>> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche < > >>>>>>> [email protected]> wrote: > >>>>>>> > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> Have you tried using : > >>>>>>>> *./nutch plugin urlnormalizer-regex > >>>>>>>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer > >>>>>>>> http://www.myinputurl.com* > >>>>>>>> that should help finding where the problem is coming from. > >>>>>>>> > >>>>>>>> Are you running in distributed mode? Did you generate a new job > file? > >>>>>>>> > >>>>>>>> J. > >>>>>>>> > >>>>>>>> > >>>>>>>> On 24 June 2010 11:18, Hannes Carl Meyer < > [email protected]>wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> I'm trying to strip a parameter from URLs using the > >>>>>>>>> RegexURLNormalizer. I > >>>>>>>>> added this to my nutch-site.xml: > >>>>>>>>> > >>>>>>>>> <property> > >>>>>>>>> <name>urlnormalizer.order</name> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value> > >>>>>>>>> </property> > >>>>>>>>> > >>>>>>>>> <property> > >>>>>>>>> <name>urlnormalizer.regex.file</name> > >>>>>>>>> <value>regex-normalize.xml</value> > >>>>>>>>> </property> > >>>>>>>>> > >>>>>>>>> And defined this expression rule: > >>>>>>>>> > >>>>>>>>> <regex> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > <pattern>(\?|&)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&|#|$)</pattern> > >>>>>>>>> <substitution>$1$5</substitution> > >>>>>>>>> </regex> > >>>>>>>>> > >>>>>>>>> (to strip the parameter IFLBSERVERID from the URL) > >>>>>>>>> > >>>>>>>>> The indexed documents are still containing the parameter and imho > >>>>>>>>> the > >>>>>>>>> RegexURLNormalizer does not work. Is it something with: > >>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-706 ? > >>>>>>>>> > >>>>>>>>> Thanks and regards > >>>>>>>>> > >>>>>>>>> Hannes > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> > >>>>>>>>> https://www.xing.com/profile/HannesCarl_Meyer > >>>>>>>>> http://de.linkedin.com/in/hannescarlmeyer > >>>>>>>>> http://twitter.com/hannescarlmeyer > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> DigitalPebble Ltd > >>>>>>>> > >>>>>>>> Open Source Solutions for Text Engineering > >>>>>>>> http://www.digitalpebble.com > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> > >>>>>>> https://www.xing.com/profile/HannesCarl_Meyer > >>>>>>> http://de.linkedin.com/in/hannescarlmeyer > >>>>>>> http://twitter.com/hannescarlmeyer > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> -- > >>>>>> DigitalPebble Ltd > >>>>>> > >>>>>> Open Source Solutions for Text Engineering > >>>>>> http://www.digitalpebble.com > >>>>>> > >>>>>> > >>>>> > >>>>> -- > >>>>> > >>>>> https://www.xing.com/profile/HannesCarl_Meyer > >>>>> http://de.linkedin.com/in/hannescarlmeyer > >>>>> http://twitter.com/hannescarlmeyer > >>>>> > >>>>> > >>>> > >>>> -- > >>>> DigitalPebble Ltd > >>>> > >>>> Open Source Solutions for Text Engineering > >>>> http://www.digitalpebble.com > >>>> > >>>> > >>> > >>> -- > >>> > >>> https://www.xing.com/profile/HannesCarl_Meyer > >>> http://de.linkedin.com/in/hannescarlmeyer > >>> http://twitter.com/hannescarlmeyer > >>> > >>> > >> > >> -- > >> DigitalPebble Ltd > >> > >> Open Source Solutions for Text Engineering > >> http://www.digitalpebble.com > >> > >> > > > > > >

