Awesome... Thank you very very much :-)

On Thu, Jun 24, 2010 at 6:55 PM, reinhard schwab <[email protected]>wrote:

> hi hannes,
>
> i have identified your problem.
> your nutch-site.xml plugin.includes property contains a newline after
> urlnormalizer-(basic|pass|regex), which breaks pattern matching in
> PluginRepository.java.
>
>  <property>
>    <name>plugin.includes</name>
>
>  
> <value>protocol-http|protocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(basic|pass|regex)
> </value>
>  </property>
>
> if i remove the newline before </value>, it is ok.
>
> regards
> reinhard
>
> Hannes Carl Meyer schrieb:
> > Just tried it in nutch-1.0 with the same kind of behavior:
> >
> > hc.me...@server01:~/nutch-1.0> ./bin/nutch plugin urlnormalizer-regex
> > org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
> > http://www.myinputurl.com
> > Plugin 'urlnormalizer-regex' not present or inactive.
> >
> > (it is present and it is active through the plugin.includes property in
> > nutch-site.xml)
> >
> > On Thu, Jun 24, 2010 at 5:45 PM, Julien Nioche <
> > [email protected]> wrote:
> >
> >
> >> the clue might be in : /~/apache-nutch-1.1-bin/plugins
> >> regenerate the job then delete this directory. Check where it gets the
> >> plugins from in the log file
> >>
> >>
> >> On 24 June 2010 16:11, Hannes Carl Meyer <[email protected]
> >wrote:
> >>
> >>
> >>> Nope, that changes nothing. Just checked out my log file:
> >>>
> >>> 2010-06-24 17:13:40,410 INFO  plugin.PluginRepository - Plugins:
> looking
> >>> in: /~/apache-nutch-1.1-bin/plugins
> >>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository - Plugin
> >>> Auto-activation mode: [true]
> >>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository - Registered
> >>> Plugins:
> >>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         the
> nutch
> >>> core extension points (nutch-extensionpoints)
> >>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
> >>> Query Filter (query-basic)
> >>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
> >>> Indexing Filter (index-basic)
> >>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Html
> Parse
> >>> Plug-in (parse-html)
> >>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Site
> Query
> >>> Filter (query-site)
> >>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Http /
> >>> Https Protocol Plug-in (protocol-httpclient)
> >>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         Basic
> >>> Summarizer Plug-in (summary-basic)
> >>> 2010-06-24 17:13:41,439 INFO  plugin.PluginRepository -         HTTP
> >>> Framework (lib-http)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Text
> Parse
> >>> Plug-in (parse-text)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Regex
> URL
> >>> Filter (urlfilter-regex)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Http
> >>> Protocol Plug-in (protocol-http)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         XML
> >>> Response Writer Plug-in (response-xml)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         OPIC
> >>> Scoring Plug-in (scoring-opic)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Tika
> >>> Parser Plug-in (parse-tika)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -
> CyberNeko
> >>> HTML Parser (lib-nekohtml)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Anchor
> >>> Indexing Filter (index-anchor)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -
> JavaScript
> >>> Parser (parse-js)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         URL
> Query
> >>> Filter (query-url)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Regex
> URL
> >>> Filter Framework (lib-regex-filter)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         JSON
> >>> Response Writer Plug-in (response-json)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository - Registered
> >>> Extension-Points:
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
> >>> Summarizer (org.apache.nutch.searcher.Summarizer)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
> >>> Protocol (org.apache.nutch.protocol.Protocol)
> >>> 2010-06-24 17:13:41,450 INFO  plugin.PluginRepository -         Nutch
> >>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
> >>> Field Filter (org.apache.nutch.indexer.field.FieldFilter)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         HTML
> Parse
> >>> Filter (org.apache.nutch.parse.HtmlParseFilter)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
> >>> Query Filter (org.apache.nutch.searcher.QueryFilter)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
> >>> Search Results Response Writer
> >>> (org.apache.nutch.searcher.response.ResponseWriter)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
> URL
> >>> Normalizer (org.apache.nutch.net.URLNormalizer)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
> URL
> >>> Filter (org.apache.nutch.net.URLFilter)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
> >>> Online Search Results Clustering Plugin
> >>> (org.apache.nutch.clustering.OnlineClusterer)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
> >>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
> >>> Content Parser (org.apache.nutch.parse.Parser)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -         Nutch
> >>> Scoring (org.apache.nutch.scoring.ScoringFilter)
> >>> 2010-06-24 17:13:41,451 INFO  plugin.PluginRepository -
> Ontology
> >>> Model Loader (org.apache.nutch.ontology.Ontology)
> >>>
> >>> There is no RegexURLNormalizer being load...
> >>>
> >>>
> >>> On Thu, Jun 24, 2010 at 4:37 PM, Julien Nioche <
> >>> [email protected]> wrote:
> >>>
> >>>
> >>>> OK. Since you are in distributed mode it should use the content of the
> >>>> job file. Try deleting ./build/plugins to see if this changes anything
> >>>>
> >>>>
> >>>> On 24 June 2010 15:30, Hannes Carl Meyer <[email protected]
> >wrote:
> >>>>
> >>>>
> >>>>> Jep, did not work, although it displays: "URL normalizing: true" in
> the
> >>>>> crawl process...
> >>>>> Also bin/nutch plugin ... does not work!
> >>>>>
> >>>>>
> >>>>> On Thu, Jun 24, 2010 at 3:06 PM, Julien Nioche <
> >>>>> [email protected]> wrote:
> >>>>>
> >>>>>
> >>>>>> tried ant clean job?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>  I'm using Nutch 1.1 and starting an intranet crawl (jobtracker is
> >>>>>>> local).
> >>>>>>>
> >>>>>>>
> >>>>>> When executing bin/nucht plugin ... I'm getting a "Plugin
> >>>>>>
> >>>>>>> 'urlnormalizer-regex' not present or inactive.".
> conf/nutch-site.xml
> >>>>>>> contains the property plugin.includes including
> urlnormalizer-regex.
> >>>>>>>
> >>>>>>>
> >>>>>>> Starting the RegexURLNormalizer from within Eclipse is fine and it
> is
> >>>>>>> doing its job.
> >>>>>>>
> >>>>>>> Regards
> >>>>>>>
> >>>>>>> Hannes
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Jun 24, 2010 at 12:46 PM, Julien Nioche <
> >>>>>>> [email protected]> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Have you tried using :
> >>>>>>>> *./nutch plugin urlnormalizer-regex
> >>>>>>>> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
> >>>>>>>> http://www.myinputurl.com*
> >>>>>>>> that should help finding where the problem is coming from.
> >>>>>>>>
> >>>>>>>> Are you running in distributed mode? Did you generate a new job
> file?
> >>>>>>>>
> >>>>>>>> J.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 24 June 2010 11:18, Hannes Carl Meyer <
> [email protected]>wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> I'm trying to strip a parameter from URLs using the
> >>>>>>>>> RegexURLNormalizer. I
> >>>>>>>>> added this to my nutch-site.xml:
> >>>>>>>>>
> >>>>>>>>>    <property>
> >>>>>>>>>        <name>urlnormalizer.order</name>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> <value>org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
> >>>>>>>>>    </property>
> >>>>>>>>>
> >>>>>>>>>    <property>
> >>>>>>>>>        <name>urlnormalizer.regex.file</name>
> >>>>>>>>>        <value>regex-normalize.xml</value>
> >>>>>>>>>    </property>
> >>>>>>>>>
> >>>>>>>>> And defined this expression rule:
> >>>>>>>>>
> >>>>>>>>> <regex>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> <pattern>(\?|&amp;)([;_]?((?i)l|j|bv_|ps_)?((?i)s|sid|IFLBSERVERID)=.*?)(\?|&amp;|#|$)</pattern>
> >>>>>>>>>  <substitution>$1$5</substitution>
> >>>>>>>>> </regex>
> >>>>>>>>>
> >>>>>>>>> (to strip the parameter IFLBSERVERID from the URL)
> >>>>>>>>>
> >>>>>>>>> The indexed documents are still containing the parameter and imho
> >>>>>>>>> the
> >>>>>>>>> RegexURLNormalizer does not work. Is it something with:
> >>>>>>>>> https://issues.apache.org/jira/browse/NUTCH-706 ?
> >>>>>>>>>
> >>>>>>>>> Thanks and regards
> >>>>>>>>>
> >>>>>>>>> Hannes
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>>
> >>>>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
> >>>>>>>>> http://de.linkedin.com/in/hannescarlmeyer
> >>>>>>>>> http://twitter.com/hannescarlmeyer
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> DigitalPebble Ltd
> >>>>>>>>
> >>>>>>>> Open Source Solutions for Text Engineering
> >>>>>>>> http://www.digitalpebble.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>>
> >>>>>>> https://www.xing.com/profile/HannesCarl_Meyer
> >>>>>>> http://de.linkedin.com/in/hannescarlmeyer
> >>>>>>> http://twitter.com/hannescarlmeyer
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> DigitalPebble Ltd
> >>>>>>
> >>>>>> Open Source Solutions for Text Engineering
> >>>>>> http://www.digitalpebble.com
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>>
> >>>>> https://www.xing.com/profile/HannesCarl_Meyer
> >>>>> http://de.linkedin.com/in/hannescarlmeyer
> >>>>> http://twitter.com/hannescarlmeyer
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> DigitalPebble Ltd
> >>>>
> >>>> Open Source Solutions for Text Engineering
> >>>> http://www.digitalpebble.com
> >>>>
> >>>>
> >>>
> >>> --
> >>>
> >>> https://www.xing.com/profile/HannesCarl_Meyer
> >>> http://de.linkedin.com/in/hannescarlmeyer
> >>> http://twitter.com/hannescarlmeyer
> >>>
> >>>
> >>
> >> --
> >> DigitalPebble Ltd
> >>
> >> Open Source Solutions for Text Engineering
> >> http://www.digitalpebble.com
> >>
> >>
> >
> >
>
>

Reply via email to