Re: need your support
Hi Sahar, Can you post your: 1. crawl-urlfilter 2. nutch-site.xml Also how are you running this program below? I'm CC'ing nutch-user@ so the community can benefit from this thread. Cheers, Chris On 1/20/10 1:42 PM, sahar elkazaz saharelka...@hotmail.com wrote: Dear/ sirur I have follow all steps on your article to run nutch and use this java program to access the segments: package nutch; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.nutch.searcher.Hit; import org.apache.nutch.searcher.HitDetails; import org.apache.nutch.searcher.Hits; import org.apache.nutch.searcher.NutchBean; import org.apache.nutch.searcher.Query; import org.apache.nutch.searcher.Summary; import org.apache.nutch.util.NutchConfiguration; public class nutch { /** For debugging. */ public static void main(String[] args) throws Exception { Configuration conf = NutchConfiguration.create(); conf = NutchConfiguration.create(); NutchBean bean = new NutchBean(conf); Query query = Query.parse(animal + , conf); Hits hits = bean.search(query, 10); System.out.println(Total hits: + hits.getTotal()); int length = (int)Math.min(hits.getTotal(), 10); Hit[] show = hits.getHits(0, length); HitDetails[] details = bean.getDetails(show); Summary[] summaries = bean.getSummary(details, query); for ( int i = 0; i summaries.length-1;i++){ System.out.println(hh); System.out.println( +i+ + details[i] + \n + summaries[i]); } } } and add the path of nutch to the classpath but i recieve exceptions: 10/01/20 22:29:27 WARN fs.FileSystem: uri=file:/// javax.security.auth.login.LoginException: Login failed: at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250) at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275) at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257) at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67) at org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:1438) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376) nb sp; at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120) at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:89) at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:77) at nutch.nutch.main(nutch.java:25) 10/01/20 22:29:28 WARN fs.FileSystem: uri=file:/// javax.security.auth.login.LoginException: Login failed: at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250) at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275) at org.apache.hadoop.security.UnixUserGroupInformation .login(UnixUserGroupInformation.java:257) at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67) at org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:1438) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120) at org.apache.nutch.searcher.LuceneSearchBean.init(LuceneSearchBean.java:50) at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:102) at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:7 7) at nutch.nutch.main(nutch.java:25) 10/01/20 22:29:28 INFO searcher.SearchBean: opening indexes in crawl/indexes 10/01/20 22:29:28 WARN fs.FileSystem: uri=file:/// javax.security.auth.login.LoginException: Login failed: at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250) at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275) at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257) at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67) at org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:1438) ;at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120) at org.apache.nutch.searcher.IndexSearcher.init(IndexSearcher.java:59) at org.apache.nutch.searcher.LuceneSearchBean.init(LuceneSearchBean.java:77) at org.apache.nutch.searcher.LuceneSearchBean.init(LuceneSearchBean.java:51) at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:102) at
Re: OR support
Nobody? Please, any answer would good. -- View this message in context: http://old.nabble.com/OR-support-tp26680899p26779229.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: OR support
On 2009-12-14 16:05, BrunoWL wrote: Nobody? Please, any answer would good. Please check this issue: https://issues.apache.org/jira/browse/NUTCH-479 That's the current status, i.e. this functionality is available only as a patch. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
OR support
Hi! Did anybody added the search with or operator in the nutch1.0 successfully? i found a patch for the 0.9 version, but doesn't work. thanks. -- View this message in context: http://old.nabble.com/OR-support-tp26680899p26680899.html Sent from the Nutch - User mailing list archive at Nabble.com.
support for robot rules that include a wild card
I'm using nutch-1.0 and have noticed after running some tests that the robot rules parser does not support wildcard (a.k.a globbing) in rules. This means the rule will not work like it was expected to by the person who wrote the robots.txt file. For example User-Agent: * Disallow: /somepath/*/someotherpath Even yahoo has one rule ( http://m.www.yahoo.com/robots.txt ) User-agent: * Disallow: /p/ Disallow: /r/ Disallow: /*? With the popularity of the wildcard (*) in robots.txt files these days what are the plans/thoughts on adding support for it in Nutch? Thanks, Jason
Re: support for robot rules that include a wild card
Hi Jason, I've been spending some time on an improved robots.txt parser, as part of my Bixo project. One aspect is support for Google wildcard extensions. I think this will be part of the proposed crawler-commons project where we'll put components that can/should be shared between Nutch, Bixo, Heritrix and Droids. One thing that would be useful is to collect examples of advanced robots.txt files, in addition to broken ones. It would be great if you could open a Jira issue and attach specific examples of the above that you know about. Thanks! -- Ken On Nov 19, 2009, at 11:31am, J.G.Konrad wrote: I'm using nutch-1.0 and have noticed after running some tests that the robot rules parser does not support wildcard (a.k.a globbing) in rules. This means the rule will not work like it was expected to by the person who wrote the robots.txt file. For example User-Agent: * Disallow: /somepath/*/someotherpath Even yahoo has one rule ( http://m.www.yahoo.com/robots.txt ) User-agent: * Disallow: /p/ Disallow: /r/ Disallow: /*? With the popularity of the wildcard (*) in robots.txt files these days what are the plans/thoughts on adding support for it in Nutch? Thanks, Jason Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Multilanguage support in Nutch 1.0
On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote: hi try to activate the language-identifier plugin you must add it in the nutch-site.xml file in the nameplugin.includes/name section. Shame on me! Thanks a lot. it's some thing like that property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property From: da...@jashi.ge Date: Tue, 29 Sep 2009 18:59:52 +0400 Subject: Multilanguage support in Nutch 1.0 To: nutch-user@lucene.apache.org Hello, all. I've got a bit of a trouble with Nutch 1.0 and multilanguage support: I have fresh install of Nutch and two analysis plugins I'd like to turn on: analysis-de (German) and analysis-ge (Georgian) Here are the innards of my seed file: --- http://212.72.133.54/l/test.html http://212.72.133.54/l/de.html --- The first is Georgian, other - German. When I run bin/nutch crawl seed -dir crawl -threads 10 -depth 2 there is not a slightest sign of someone calling any analysis plug-ins, even though it's clearly stated in hadoop.log, that they are on and active: --- 2009-09-29 16:39:13,328 INFO crawl.Crawl - crawl started in: crawl 2009-09-29 16:39:13,328 INFO crawl.Crawl - rootUrlDir = seed 2009-09-29 16:39:13,328 INFO crawl.Crawl - threads = 10 2009-09-29 16:39:13,328 INFO crawl.Crawl - depth = 2 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: starting 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: urlDir: seed 2009-09-29 16:39:13,390 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2009-09-29 16:39:13,421 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-09-29 16:39:15,546 INFO plugin.PluginRepository - Plugins: looking in: C:\cygwin\opt\nutch\plugins 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Registered Plugins: 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Lucene Analysers (lib-lucene-analyzers) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Language Identification Parser/Filter (language-identifier) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) ! 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Georgian Analysis Plug-in (analysis-ge) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - German Analysis Plug-in (analysis-de) ! 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Site Query Filter (query-site) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - More Query Filter (query-more) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2009-09-29 16:39
Re: Multilanguage support in Nutch 1.0
On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote: hi try to activate the language-identifier plugin you must add it in the nutch-site.xml file in the nameplugin.includes/name section. Ooops. It IS activated. 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Language Identification Parser/Filter (language-identifier) But fetched pages are not passed to it, as I recon.
RE: Multilanguage support in Nutch 1.0
hi, do you have some metadata 'lang' on the pages . becoz the plugin try first to get the language form metadata.. if you see in the java source of the plugin LanguageIndexingFilter.java // check if LANGUAGE found, possibly put there by HTMLLanguageParser String lang = parse.getData().getParseMeta().get(Metadata.LANGUAGE); // check if HTTP-header tels us the language if (lang == null) { lang = parse.getData().getContentMeta().get(Response.CONTENT_LANGUAGE); } try to use also LUKE to check all your metadata on the index. From: da...@jashi.ge Date: Wed, 30 Sep 2009 17:22:26 +0400 Subject: Re: Multilanguage support in Nutch 1.0 To: nutch-user@lucene.apache.org On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote: hi try to activate the language-identifier plugin you must add it in the nutch-site.xml file in the nameplugin.includes/name section. Ooops. It IS activated. 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Language Identification Parser/Filter (language-identifier) But fetched pages are not passed to it, as I recon. _ Windows Live helps you keep up with all your friends, in one place. http://go.microsoft.com/?linkid=9660826
Multilanguage support in Nutch 1.0
Hello, all. I've got a bit of a trouble with Nutch 1.0 and multilanguage support: I have fresh install of Nutch and two analysis plugins I'd like to turn on: analysis-de (German) and analysis-ge (Georgian) Here are the innards of my seed file: --- http://212.72.133.54/l/test.html http://212.72.133.54/l/de.html --- The first is Georgian, other - German. When I run bin/nutch crawl seed -dir crawl -threads 10 -depth 2 there is not a slightest sign of someone calling any analysis plug-ins, even though it's clearly stated in hadoop.log, that they are on and active: --- 2009-09-29 16:39:13,328 INFO crawl.Crawl - crawl started in: crawl 2009-09-29 16:39:13,328 INFO crawl.Crawl - rootUrlDir = seed 2009-09-29 16:39:13,328 INFO crawl.Crawl - threads = 10 2009-09-29 16:39:13,328 INFO crawl.Crawl - depth = 2 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: starting 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: urlDir: seed 2009-09-29 16:39:13,390 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2009-09-29 16:39:13,421 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-09-29 16:39:15,546 INFO plugin.PluginRepository - Plugins: looking in: C:\cygwin\opt\nutch\plugins 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Registered Plugins: 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Lucene Analysers (lib-lucene-analyzers) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Language Identification Parser/Filter (language-identifier) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) ! 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Georgian Analysis Plug-in (analysis-ge) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - German Analysis Plug-in (analysis-de) ! 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Site Query Filter (query-site) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - More Query Filter (query-more) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - URL Query Filter (query-url) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) --- At the same time: --- 2009-09-29 16:39:54,406 INFO lang.LanguageIdentifier - Language identifier configuration [1-4/2048] 2009-09-29 16:39:54,609 INFO lang.LanguageIdentifier - Language identifier plugin supports: it(1000) is(1000) hu(1000) th(1000) sv(1000) ge(1000) fr(1000) ru(1000) fi(1000) es(1000) en(1000) el(1000) ee(1000) pt(1000) de(1000) da(1000) pl(1000) no(1000) nl(1000) --- Language indentifier works as a charm at the same time: --- $ bin/nutch plugin language-identifier org.apache.nutch.analysis.lang.LanguageIdentifier -identifyurl http://212.72.133.54/l/test.html text was identified as ge --- $ bin/nutch plugin language-identifier org.apache.nutch.analysis.lang.LanguageIdentifier -identifyurl http://212.72.133.54/l/de.html text was identified as de --- What could have possibly gone wrong? პატივისცემით, დავით ჯაში
RE: Multilanguage support in Nutch 1.0
hi try to activate the language-identifier plugin you must add it in the nutch-site.xml file in the nameplugin.includes/name section. it's some thing like that property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property From: da...@jashi.ge Date: Tue, 29 Sep 2009 18:59:52 +0400 Subject: Multilanguage support in Nutch 1.0 To: nutch-user@lucene.apache.org Hello, all. I've got a bit of a trouble with Nutch 1.0 and multilanguage support: I have fresh install of Nutch and two analysis plugins I'd like to turn on: analysis-de (German) and analysis-ge (Georgian) Here are the innards of my seed file: --- http://212.72.133.54/l/test.html http://212.72.133.54/l/de.html --- The first is Georgian, other - German. When I run bin/nutch crawl seed -dir crawl -threads 10 -depth 2 there is not a slightest sign of someone calling any analysis plug-ins, even though it's clearly stated in hadoop.log, that they are on and active: --- 2009-09-29 16:39:13,328 INFO crawl.Crawl - crawl started in: crawl 2009-09-29 16:39:13,328 INFO crawl.Crawl - rootUrlDir = seed 2009-09-29 16:39:13,328 INFO crawl.Crawl - threads = 10 2009-09-29 16:39:13,328 INFO crawl.Crawl - depth = 2 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: starting 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: urlDir: seed 2009-09-29 16:39:13,390 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2009-09-29 16:39:13,421 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-09-29 16:39:15,546 INFO plugin.PluginRepository - Plugins: looking in: C:\cygwin\opt\nutch\plugins 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Registered Plugins: 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Lucene Analysers (lib-lucene-analyzers) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Language Identification Parser/Filter (language-identifier) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) ! 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Georgian Analysis Plug-in (analysis-ge) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - German Analysis Plug-in (analysis-de) ! 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Site Query Filter (query-site) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - More Query Filter (query-more) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2009-09-29 16:39
Development support
Hi, we're looking for a Nutch developer to implement some plugins for us in the next few weeks. Substantial knowledge in Nutch, Java and Databases is needed. If yor're interested, please contact me (koch at huberverlag dot de) Thanks in advance, Martina
Re: Support needed
As a very old nutch user an developer of plugins and even implemented nutch in some products - I could help you. I am based in Houston, Texas -- skype me on hooduku sudhi --- On Mon, 7/27/09, sf30098 sf30...@yahoo.com wrote: From: sf30098 sf30...@yahoo.com Subject: Support needed To: nutch-user@lucene.apache.org Date: Monday, July 27, 2009, 4:01 PM I need someone with substantial knowledge in Nutch, Java and Lucene and have customised the system before. In particular, this should be related to image indexing and geo-positioning. if possible (either or, is good as well). The job role will be on providing supports and advice on how to go about implementing such system.. This includes: 1. replying questions and providing guidance in implementation 2. reviewing codes and providing suggestions as to how to improve. Please let me know if you're interested. -- View this message in context: http://www.nabble.com/Support-needed-tp24688172p24688172.html Sent from the Nutch - User mailing list archive at Nabble.com.
Support needed
I need someone with substantial knowledge in Nutch, Java and Lucene and have customised the system before. In particular, this should be related to image indexing and geo-positioning. if possible (either or, is good as well). The job role will be on providing supports and advice on how to go about implementing such system.. This includes: 1. replying questions and providing guidance in implementation 2. reviewing codes and providing suggestions as to how to improve. Please let me know if you're interested. -- View this message in context: http://www.nabble.com/Support-needed-tp24688172p24688172.html Sent from the Nutch - User mailing list archive at Nabble.com.
Multi-Lingual Support in Nutch
Hello, I am using Nutch 0.9. I would like to enable multi-lingual support in our existing system. I read the article on Multi-Lingual Support in Nutch by Jérôme Charron. But it is about the previous versions of Nutch. I included the plugin in Nutch-Site.xml as analysis-es. What are the other steps to be followed to enable multi-lingual support ? Thanks Regards, Kunal
Professional Nutch Support and Distribution
Wanted to gauge community interest in having a certified Nutch distribution with support? Similar to what Lucid Imagination is doing for Solr and Lucene and what Cloudera is providing for Hadoop. Anybody interested? Dennis
Re: Professional Nutch Support and Distribution
This sounds interesting. I might be interested in this. Marc Boucher http://hyperix.com On Tue, Mar 17, 2009 at 12:31 PM, Dennis Kubes ku...@apache.org wrote: Wanted to gauge community interest in having a certified Nutch distribution with support? Similar to what Lucid Imagination is doing for Solr and Lucene and what Cloudera is providing for Hadoop. Anybody interested? Dennis
Does Nutch support the boolean OR operator in a search query?
Hi, Does Nutch support the boolean OR operator (or something similar) in a search query? I mean is there any class already available to do this? The Nutch search interface doesn't seem to have this option. Expcted functionality: If I ask it to search for (Post Graduate) OR (Masters), it should fetch the pages which contain at least one of {Post Graduate, Masters}. Thank you, Ram.
Re: Does Nutch support the boolean OR operator in a search query?
Hi, On Mon, Jan 19, 2009 at 4:02 PM, M S Ram ms...@cse.iitk.ac.in wrote: Hi, Does Nutch support the boolean OR operator (or something similar) in a search query? I mean is there any class already available to do this? The Nutch search interface doesn't seem to have this option. Expcted functionality: If I ask it to search for (Post Graduate) OR (Masters), it should fetch the pages which contain at least one of {Post Graduate, Masters}. Unfortunately no. There is an issue with a patch https://issues.apache.org/jira/browse/NUTCH-479 but nothing happened for a while. Thank you, Ram. -- Doğacan Güney
Re: Does Nutch support the boolean OR operator in a search query?
Oh! That's sad! :( What is the best approach to provide an OR search now? Should I go down to Lucene? Does Lucene understand HDFS? Please help me with the appropriate guide lines. Thank you, Ram Doğacan Güney wrote: Hi, On Mon, Jan 19, 2009 at 4:02 PM, M S Ram ms...@cse.iitk.ac.in wrote: Hi, Does Nutch support the boolean OR operator (or something similar) in a search query? I mean is there any class already available to do this? The Nutch search interface doesn't seem to have this option. Expcted functionality: If I ask it to search for (Post Graduate) OR (Masters), it should fetch the pages which contain at least one of {Post Graduate, Masters}. Unfortunately no. There is an issue with a patch https://issues.apache.org/jira/browse/NUTCH-479 but nothing happened for a while. Thank you, Ram.
Re: Does Nutch support the boolean OR operator in a search query?
Lucene has support for OR queries, so it should be possible to do it, but support for this in nutch isn't available as far as I know. I'd also be intersted if anyone has managed to implement this. On Tue, Jan 20, 2009 at 1:50 AM, M S Ram ms...@cse.iitk.ac.in wrote: Oh! That's sad! :( What is the best approach to provide an OR search now? Should I go down to Lucene? Does Lucene understand HDFS? Please help me with the appropriate guide lines. Thank you, Ram Doğacan Güney wrote: Hi, On Mon, Jan 19, 2009 at 4:02 PM, M S Ram ms...@cse.iitk.ac.in wrote: Hi, Does Nutch support the boolean OR operator (or something similar) in a search query? I mean is there any class already available to do this? The Nutch search interface doesn't seem to have this option. Expcted functionality: If I ask it to search for (Post Graduate) OR (Masters), it should fetch the pages which contain at least one of {Post Graduate, Masters}. Unfortunately no. There is an issue with a patch https://issues.apache.org/jira/browse/NUTCH-479 but nothing happened for a while. Thank you, Ram.
does nutch support crawling cold fusion pages?
Hi, Does anyone know if there is a plugin for cold fusion pages or if it's supported? I'm trying to crawl http://www.knowitall.org/naturalstate Thanks in advance, Alex
What kind of searches does Nutch support?
What kind of searches does Nutch support?
Missing zh.ngp for zh locate support for language Identifier
Hi all, I found there is missing zh.ngp for zh locate. I have seen this file via a screenshot and then I googled the filename return nothing for me...can anyone provide this file for me? Thank you -- View this message in context: http://www.nabble.com/Missing-zh.ngp-for-zh-locate-support-for-language-Identifier-tp16068532p16068532.html Sent from the Nutch - User mailing list archive at Nabble.com.
Support Hardware and OS for nutch and hadoop
Hello Frens, I am gathering information on supoorted hardware and OS for nutch and hadoop . I did not find any conclusive information by going thru Nutch wiki. If I want to build a cluster of nodes using nutch/hadoop for crawling then what are my options for H/W and OS ?
Prefix Query in Nutch and Wildcard support.
Hello Frens, Is there anyway to do prefix query in Nutch ? Eg Query the content field for the occurance of abc* ? I could do it in Lucene, but i want to do it in nuthch . Going through the mialing list it appeared that Nutch does not support such queries. Is it ture ? Thanks !
Re: NUTCH-479 Support for OR queries - what is this about
Thanks for the answer. That was helpful. I was sooo wrong. On 7/7/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Briggs wrote: Please keep this thread going as I am also curious to know why this has been 'forked'. I am sure that most of this lies within the original OPIC filter but I still can't understand why straight forward lucene queries have not been used within the application. No, this has actually almost nothing to do with the scoring filters (which were added much later). The decision to use a different query syntax than the one from Lucene was motivated by a few reasons: * to avoid the need to support low-level index and searcher operations, which the Lucene API would require us to implement. * to keep the Nutch core largely independent of Lucene, so that it's possible to use Nutch with different back-end searcher implementations. This started to materialize only now, with the ongoing effort to use Solr as a possible backend. * to limit the query syntax to those queries that provide best tradeoff between functionality and performance, in a large-scale search engine. On 7/6/07, Kai_testing Middleton [EMAIL PROTECTED] wrote: Ok, so I guess what I don't understand is what is the Nutch query syntax? Query syntax is defined in an informal way on the Help page in nutch.war, or here: http://wiki.apache.org/nutch/Features Formal syntax definition can be gleaned from org.apache.nutch.analysis.NutchAnalysis.jj. The main discussion I found on nutch-user is this: http://osdir.com/ml/search.nutch.devel/2004-02/msg7.html I was wondering why the query syntax is so limited. There are no OR queries, there are no fielded queries, or fuzzy, or approximate... Why? The underlying index supports all these operations. Actually, it's possible to configure Nutch to allow raw field queries - you need to add a raw field query plugin for this. Pleae see RawFieldQueryFilter class, and existing plugins that use fielded queries: query-site, and query-more. Query-more / DateQueryFilter is especially interesting, because it shows how to use raw token values from a parsed query to build complex Lucene queries. I notice by looking at the or.patch file (https://issues.apache.org/jira/secure/attachment/12360659/or.patch) that one of the programs under consideration is: nutch/searcher/Query.java The code for this is distinct from lucene/search/Query.java See above - they are completely different classes, with completely different purpose. The use of the same class name is unfortunate and misleading. Nutch Query class is intended to express queries entered by search engine users, in a tokenized and parsed way, so that the rest of Nutch may deal with Clauses, Terms and Phrases instead of plain String-s. On the other hand, Lucene Query is intended to express arbitrarily complex Lucene queries - many of these queries would be prohibitively expensive for a large search engine (e.g. wildcard queries). It looks like this is an architecture issue that I don't understand. If nutch is an extension of lucene, why does it define a different Query class? Nutch is NOT an extension of Lucene. It's an application that uses Lucene as a library. Why don't we just use the Lucene code to query the indexes? Does this have something to do with the nutch webapp (nutch.war)? What is the historical genesis of this issue (or is that even relevant)? Nutch webapp doesn't have anything to do with it. The limitations in the query syntax have different roots (see above). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Conscious decisions by conscious minds are what make reality real
Re: NUTCH-479 Support for OR queries - what is this about
Please keep this thread going as I am also curious to know why this has been 'forked'. I am sure that most of this lies within the original OPIC filter but I still can't understand why straight forward lucene queries have not been used within the application. On 7/6/07, Kai_testing Middleton [EMAIL PROTECTED] wrote: I've been reading up on NUTCH-479 Support for OR queries but I must be missing something obvious because I don't understand what the JIRA is about: https://issues.apache.org/jira/browse/NUTCH-479 Description: There have been many requests from users to extend Nutch query syntax to add support for OR queries, in addition to the implicit AND and NOT queries supported now. Ok, so I guess what I don't understand is what is the Nutch query syntax? The main discussion I found on nutch-user is this: http://osdir.com/ml/search.nutch.devel/2004-02/msg7.html I was wondering why the query syntax is so limited. There are no OR queries, there are no fielded queries, or fuzzy, or approximate... Why? The underlying index supports all these operations. I notice by looking at the or.patch file (https://issues.apache.org/jira/secure/attachment/12360659/or.patch) that one of the programs under consideration is: nutch/searcher/Query.java The code for this is distinct from lucene/search/Query.java It looks like this is an architecture issue that I don't understand. If nutch is an extension of lucene, why does it define a different Query class? Why don't we just use the Lucene code to query the indexes? Does this have something to do with the nutch webapp (nutch.war)? What is the historical genesis of this issue (or is that even relevant)? We won't tell. Get more on shows you hate to love (and love to hate): Yahoo! TV's Guilty Pleasures list. http://tv.yahoo.com/collections/265 -- Conscious decisions by conscious minds are what make reality real
Re: NUTCH-479 Support for OR queries - what is this about
Briggs wrote: Please keep this thread going as I am also curious to know why this has been 'forked'. I am sure that most of this lies within the original OPIC filter but I still can't understand why straight forward lucene queries have not been used within the application. No, this has actually almost nothing to do with the scoring filters (which were added much later). The decision to use a different query syntax than the one from Lucene was motivated by a few reasons: * to avoid the need to support low-level index and searcher operations, which the Lucene API would require us to implement. * to keep the Nutch core largely independent of Lucene, so that it's possible to use Nutch with different back-end searcher implementations. This started to materialize only now, with the ongoing effort to use Solr as a possible backend. * to limit the query syntax to those queries that provide best tradeoff between functionality and performance, in a large-scale search engine. On 7/6/07, Kai_testing Middleton [EMAIL PROTECTED] wrote: Ok, so I guess what I don't understand is what is the Nutch query syntax? Query syntax is defined in an informal way on the Help page in nutch.war, or here: http://wiki.apache.org/nutch/Features Formal syntax definition can be gleaned from org.apache.nutch.analysis.NutchAnalysis.jj. The main discussion I found on nutch-user is this: http://osdir.com/ml/search.nutch.devel/2004-02/msg7.html I was wondering why the query syntax is so limited. There are no OR queries, there are no fielded queries, or fuzzy, or approximate... Why? The underlying index supports all these operations. Actually, it's possible to configure Nutch to allow raw field queries - you need to add a raw field query plugin for this. Pleae see RawFieldQueryFilter class, and existing plugins that use fielded queries: query-site, and query-more. Query-more / DateQueryFilter is especially interesting, because it shows how to use raw token values from a parsed query to build complex Lucene queries. I notice by looking at the or.patch file (https://issues.apache.org/jira/secure/attachment/12360659/or.patch) that one of the programs under consideration is: nutch/searcher/Query.java The code for this is distinct from lucene/search/Query.java See above - they are completely different classes, with completely different purpose. The use of the same class name is unfortunate and misleading. Nutch Query class is intended to express queries entered by search engine users, in a tokenized and parsed way, so that the rest of Nutch may deal with Clauses, Terms and Phrases instead of plain String-s. On the other hand, Lucene Query is intended to express arbitrarily complex Lucene queries - many of these queries would be prohibitively expensive for a large search engine (e.g. wildcard queries). It looks like this is an architecture issue that I don't understand. If nutch is an extension of lucene, why does it define a different Query class? Nutch is NOT an extension of Lucene. It's an application that uses Lucene as a library. Why don't we just use the Lucene code to query the indexes? Does this have something to do with the nutch webapp (nutch.war)? What is the historical genesis of this issue (or is that even relevant)? Nutch webapp doesn't have anything to do with it. The limitations in the query syntax have different roots (see above). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
NUTCH-479 Support for OR queries - what is this about
I've been reading up on NUTCH-479 Support for OR queries but I must be missing something obvious because I don't understand what the JIRA is about: https://issues.apache.org/jira/browse/NUTCH-479 Description: There have been many requests from users to extend Nutch query syntax to add support for OR queries, in addition to the implicit AND and NOT queries supported now. Ok, so I guess what I don't understand is what is the Nutch query syntax? The main discussion I found on nutch-user is this: http://osdir.com/ml/search.nutch.devel/2004-02/msg7.html I was wondering why the query syntax is so limited. There are no OR queries, there are no fielded queries, or fuzzy, or approximate... Why? The underlying index supports all these operations. I notice by looking at the or.patch file (https://issues.apache.org/jira/secure/attachment/12360659/or.patch) that one of the programs under consideration is: nutch/searcher/Query.java The code for this is distinct from lucene/search/Query.java It looks like this is an architecture issue that I don't understand. If nutch is an extension of lucene, why does it define a different Query class? Why don't we just use the Lucene code to query the indexes? Does this have something to do with the nutch webapp (nutch.war)? What is the historical genesis of this issue (or is that even relevant)? We won't tell. Get more on shows you hate to love (and love to hate): Yahoo! TV's Guilty Pleasures list. http://tv.yahoo.com/collections/265
How best to add sponsored link support..??
Hi all, I've been tasked with looking into this and am not a coder - that said, Nutch is doing great and the bean counters have asked me to look into adding sponsored link results and I'm wondering how best to add this. It would be nice to utilize the Nutch engine to come up with the pages versus just doing a lookup on words and results in a flat file but the key word data could change daily (hourly) and would need to be able to be hand entered (or automated) as people sign up (re-index is not really an option). I'm not sure this would fly within the main Nutch segments and index, but I could see maybe a separate index or possibly adding a flag to the existing data but I've not seen any easy to use tools to change/update/insert records into what is already there (yes Luke on the index but that does not touch the segment data, right?). I don't want to change existing searched data and I don't see an issue with having duplicate results (sponsored up top and existing entry down below somewhere) but it would be more elegant to not have that occur. I also see issues in a simple flat file look up as a multiple word search is best handled inside Nutch to score the results versus having to do something similar in the sponsored results. I can see the need to control the summary text displayed and also pass thru any codes in the URL which are currently being stripped during the main crawl/index cycle. I also see issues with seriously customizing the internals as they would have to be maintained as Nutch itself is updated If anyone has looked at this and has at least some ideas on how best to do this let me know. I need to come up with a preliminary estimate before I can engage and pay the coders to make this happen so if there are any easy or best practices ways on doing this any help/pointers would be appreciated -- rp
Re: How best to add sponsored link support..??
Let me qualify this - ad banner rotation is dealt with - I'm looking for something that will use our Nutch engine to serve up relevant links from people who pay for that privilege. We do not want to serve up ad's from someone else's system i.e. the big G or Y, but use our own Nutch search results to serve up relevant paying links that we have sold and maintain. In a simple relational SQL world we would add a flag and another table with the links and scores and look that up and pass back when needed. Problem with that is that we lose the whole multi word scoring capability in Nutch i.e. pizza beer Chicago, should serve up a Chicago pizza ad first and beer ads further down, just like our search results have relevancy (not a great example but you get the idea). Re-writing a scoring engine to do that in SQL seems like a waste when Nutch already does it just fine. So in a nutshell - we need to do what the big G and Y and other do when serving up key word based sponsor links. My thought - automate the build of a dummy page with the key words bought that would be indexed and served up just like regular crawled and indexed pages, using the scoring to rank them in terms of relevancy and placement - I have not seen any snippets of code to do simple insert/update/delete operations on a Nutch segment or index however This is the idea gathering phase - think like a school/college search engine with local paying advertisers - we want to serve those links up to the searchers to help offset the cost of the service and serve up or flag links that rank first because of payment followed by normal search link results rp Sean Dean wrote: I might be totally off base with what your asking to do, but take a look at this open source project: http://phpadsnew.com/two/. Its basically an advertising engine, built on PHP. Integration within any application is a breeze, and it supports external advertising such as Google Ads. Sean - Original Message From: RP [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Tuesday, December 19, 2006 10:52:56 AM Subject: How best to add sponsored link support..?? Hi all, I've been tasked with looking into this and am not a coder - that said, Nutch is doing great and the bean counters have asked me to look into adding sponsored link results and I'm wondering how best to add this. It would be nice to utilize the Nutch engine to come up with the pages versus just doing a lookup on words and results in a flat file but the key word data could change daily (hourly) and would need to be able to be hand entered (or automated) as people sign up (re-index is not really an option). I'm not sure this would fly within the main Nutch segments and index, but I could see maybe a separate index or possibly adding a flag to the existing data but I've not seen any easy to use tools to change/update/insert records into what is already there (yes Luke on the index but that does not touch the segment data, right?). I don't want to change existing searched data and I don't see an issue with having duplicate results (sponsored up top and existing entry down below somewhere) but it would be more elegant to not have that occur. I also see issues in a simple flat file look up as a multiple word search is best handled inside Nutch to score the results versus having to do something similar in the sponsored results. I can see the need to control the summary text displayed and also pass thru any codes in the URL which are currently being stripped during the main crawl/index cycle. I also see issues with seriously customizing the internals as they would have to be maintained as Nutch itself is updated If anyone has looked at this and has at least some ideas on how best to do this let me know. I need to come up with a preliminary estimate before I can engage and pay the coders to make this happen so if there are any easy or best practices ways on doing this any help/pointers would be appreciated
Re: How best to add sponsored link support..??
Are you looking for something like the google keymatch as described in [1] which was then more or less mimiced in nutch web2 module[1], and since also atleast as a lookalike released in google code [3] -- Sami Siren [1] http://www.google.com/enterprise/mini/end_user_features.html [2] http://svn.apache.org/viewvc/lucene/nutch/trunk/contrib/web2/plugins/web-keymatch/ [3] http://custom-keymatch-onebox.googlecode.com/svn/trunk/Keymatch.java 2006/12/19, RP [EMAIL PROTECTED]: Let me qualify this - ad banner rotation is dealt with - I'm looking for something that will use our Nutch engine to serve up relevant links from people who pay for that privilege. We do not want to serve up ad's from someone else's system i.e. the big G or Y, but use our own Nutch search results to serve up relevant paying links that we have sold and maintain. In a simple relational SQL world we would add a flag and another table with the links and scores and look that up and pass back when needed. Problem with that is that we lose the whole multi word scoring capability in Nutch i.e. pizza beer Chicago, should serve up a Chicago pizza ad first and beer ads further down, just like our search results have relevancy (not a great example but you get the idea). Re-writing a scoring engine to do that in SQL seems like a waste when Nutch already does it just fine. So in a nutshell - we need to do what the big G and Y and other do when serving up key word based sponsor links. My thought - automate the build of a dummy page with the key words bought that would be indexed and served up just like regular crawled and indexed pages, using the scoring to rank them in terms of relevancy and placement - I have not seen any snippets of code to do simple insert/update/delete operations on a Nutch segment or index however This is the idea gathering phase - think like a school/college search engine with local paying advertisers - we want to serve those links up to the searchers to help offset the cost of the service and serve up or flag links that rank first because of payment followed by normal search link results rp Sean Dean wrote: I might be totally off base with what your asking to do, but take a look at this open source project: http://phpadsnew.com/two/. Its basically an advertising engine, built on PHP. Integration within any application is a breeze, and it supports external advertising such as Google Ads. Sean - Original Message From: RP [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Tuesday, December 19, 2006 10:52:56 AM Subject: How best to add sponsored link support..?? Hi all, I've been tasked with looking into this and am not a coder - that said, Nutch is doing great and the bean counters have asked me to look into adding sponsored link results and I'm wondering how best to add this. It would be nice to utilize the Nutch engine to come up with the pages versus just doing a lookup on words and results in a flat file but the key word data could change daily (hourly) and would need to be able to be hand entered (or automated) as people sign up (re-index is not really an option). I'm not sure this would fly within the main Nutch segments and index, but I could see maybe a separate index or possibly adding a flag to the existing data but I've not seen any easy to use tools to change/update/insert records into what is already there (yes Luke on the index but that does not touch the segment data, right?). I don't want to change existing searched data and I don't see an issue with having duplicate results (sponsored up top and existing entry down below somewhere) but it would be more elegant to not have that occur. I also see issues in a simple flat file look up as a multiple word search is best handled inside Nutch to score the results versus having to do something similar in the sponsored results. I can see the need to control the summary text displayed and also pass thru any codes in the URL which are currently being stripped during the main crawl/index cycle. I also see issues with seriously customizing the internals as they would have to be maintained as Nutch itself is updated If anyone has looked at this and has at least some ideas on how best to do this let me know. I need to come up with a preliminary estimate before I can engage and pay the coders to make this happen so if there are any easy or best practices ways on doing this any help/pointers would be appreciated
Re: How best to add sponsored link support..??
Thanks Sami, This is closer from an initial look - does this do anything on the backend (i.e. defining the data flags sow e can get a match) as well or do we need to build that..?? Sami Siren wrote: Are you looking for something like the google keymatch as described in [1] which was then more or less mimiced in nutch web2 module[1], and since also atleast as a lookalike released in google code [3] -- Sami Siren [1] http://www.google.com/enterprise/mini/end_user_features.html [2] http://svn.apache.org/viewvc/lucene/nutch/trunk/contrib/web2/plugins/web-keymatch/ [3] http://custom-keymatch-onebox.googlecode.com/svn/trunk/Keymatch.java 2006/12/19, RP [EMAIL PROTECTED]: Let me qualify this - ad banner rotation is dealt with - I'm looking for something that will use our Nutch engine to serve up relevant links from people who pay for that privilege. We do not want to serve up ad's from someone else's system i.e. the big G or Y, but use our own Nutch search results to serve up relevant paying links that we have sold and maintain. In a simple relational SQL world we would add a flag and another table with the links and scores and look that up and pass back when needed. Problem with that is that we lose the whole multi word scoring capability in Nutch i.e. pizza beer Chicago, should serve up a Chicago pizza ad first and beer ads further down, just like our search results have relevancy (not a great example but you get the idea). Re-writing a scoring engine to do that in SQL seems like a waste when Nutch already does it just fine. So in a nutshell - we need to do what the big G and Y and other do when serving up key word based sponsor links. My thought - automate the build of a dummy page with the key words bought that would be indexed and served up just like regular crawled and indexed pages, using the scoring to rank them in terms of relevancy and placement - I have not seen any snippets of code to do simple insert/update/delete operations on a Nutch segment or index however This is the idea gathering phase - think like a school/college search engine with local paying advertisers - we want to serve those links up to the searchers to help offset the cost of the service and serve up or flag links that rank first because of payment followed by normal search link results rp Sean Dean wrote: I might be totally off base with what your asking to do, but take a look at this open source project: http://phpadsnew.com/two/. Its basically an advertising engine, built on PHP. Integration within any application is a breeze, and it supports external advertising such as Google Ads. Sean - Original Message From: RP [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Tuesday, December 19, 2006 10:52:56 AM Subject: How best to add sponsored link support..?? Hi all, I've been tasked with looking into this and am not a coder - that said, Nutch is doing great and the bean counters have asked me to look into adding sponsored link results and I'm wondering how best to add this. It would be nice to utilize the Nutch engine to come up with the pages versus just doing a lookup on words and results in a flat file but the key word data could change daily (hourly) and would need to be able to be hand entered (or automated) as people sign up (re-index is not really an option). I'm not sure this would fly within the main Nutch segments and index, but I could see maybe a separate index or possibly adding a flag to the existing data but I've not seen any easy to use tools to change/update/insert records into what is already there (yes Luke on the index but that does not touch the segment data, right?). I don't want to change existing searched data and I don't see an issue with having duplicate results (sponsored up top and existing entry down below somewhere) but it would be more elegant to not have that occur. I also see issues in a simple flat file look up as a multiple word search is best handled inside Nutch to score the results versus having to do something similar in the sponsored results. I can see the need to control the summary text displayed and also pass thru any codes in the URL which are currently being stripped during the main crawl/index cycle. I also see issues with seriously customizing the internals as they would have to be maintained as Nutch itself is updated If anyone has looked at this and has at least some ideas on how best to do this let me know. I need to come up with a preliminary estimate before I can engage and pay the coders to make this happen so if there are any easy or best practices ways on doing this any help/pointers would be appreciated -- rp
Re: Lucene query support in Nutch
Cristina Belderrain wrote: On 10/9/06, Tomi NA [EMAIL PROTECTED] wrote: This is *exactly* what I was thinking. Like Stefan, I believe the nutch analyzer is a good foundation and should therefore be extended to support the or operator, and possibly additional capabilities when the need arises. t.n.a. Tomi, why would you extend Nutch's analyzer when Lucene's analyzer, which does exactly what you want, is already there? To what I understood so far in this thread the Nutch analyser/query-whatever seems to be more targeted and provides additional features regarding distributed search as well as maybe speed-improvements due to it's nature etc. (Correct me if I'm wrong.) One idea that has come up was to offer both as alternatives so you could use Lucene-based queries if you need it's features on the one hand but can live with restrictions on the other. However due to what has been mentioned so far it seems that Lucene-queries by default can only be on document-content (is that right?) not e.g. site:www.example.org. Hmm ... PS: Thank you all for help offered so far in this thread on how to get Lucene-queries going. Unfortunately I couldn't make much use of just simply extend it here and there ... :-( Regards, Stefan
Re: Lucene query support in Nutch
2006/10/10, Cristina Belderrain [EMAIL PROTECTED]: On 10/9/06, Tomi NA [EMAIL PROTECTED] wrote: This is *exactly* what I was thinking. Like Stefan, I believe the nutch analyzer is a good foundation and should therefore be extended to support the or operator, and possibly additional capabilities when the need arises. t.n.a. Tomi, why would you extend Nutch's analyzer when Lucene's analyzer, which does exactly what you want, is already there? Stefan basically answered that question, but basically, my opinion is that Nutch's analyzer does it's job well, but only lacks one obvious query capability: the or search. The fact that several users here need this kind of functionality suggests it's not the beginning of a landslide of new required capabilities. Lucene's analyzer, on the other hand, is completely inadequate in this respect if search is necessarily bound to a single (content) field. In conclusion, my position is pragmatic: I welcome the simplest solution to implement the or search. I just believe that it'd be easiest to do that extending the nutch Analyzer. t.n.a.
Re: Lucene query support in Nutch
Tomi said: In conclusion, my position is pragmatic: I welcome the simplest solution to implement the or search. I just believe that it'd be easiest to do that extending the nutch Analyzer. This seems like a very reasonable approach. I too would very much like OR. It would also be nice if it worked in 0.7.2 and I could drop it in, but that may be asking for too much. - Bill -- *--* | Bill Goffe [EMAIL PROTECTED] | | Department of Economicsvoice: (315) 312-3444 | | SUNY Oswegofax: (315) 312-5444 | | 416 Mahar Hall http://cook.rfe.org | | Oswego, NY 13126| **--*---* | Been there. Done that. | | -- Ed Viesturs as he looked up Mount Everest. He climbed it five times, | | twice without oxygen. He now plans to be the first American to scale | | all of the world's 8,000 meter mountains. Climber for the Ages Has | | Next Peak in View, New York Times, 2/13/00. | *---*
Re: Lucene query support in Nutch
2006/10/8, Stefan Neufeind [EMAIL PROTECTED]: if it's not the full feature-set, maybe most people could live with it. But basic boolean queries I think were the root for this topic. Is there an easier way to allow this in Nutch as well instead of throwing quite a bit away and using the Lucene-syntax? As has just been pointed out: It This is *exactly* what I was thinking. Like Stefan, I believe the nutch analyzer is a good foundation and should therefore be extended to support the or operator, and possibly additional capabilities when the need arises. t.n.a.
Re: Lucene query support in Nutch
On 10/9/06, Tomi NA [EMAIL PROTECTED] wrote: This is *exactly* what I was thinking. Like Stefan, I believe the nutch analyzer is a good foundation and should therefore be extended to support the or operator, and possibly additional capabilities when the need arises. t.n.a. Tomi, why would you extend Nutch's analyzer when Lucene's analyzer, which does exactly what you want, is already there? Regards, Cristina
Re: Lucene query support in Nutch
Hello, I just would like to confirm that the version of the search() method shown in the previous post works fine, at least regarding boolean queries. Anyway, I see no reason why it wouldn't work with any other Lucene query (fuzzy, proximity, etc.). Now, please be warned that the inclusion of this new method in IndexSearcher has quite an impact on some other classes: besides NutchBean, where you'll need to add the wrapper methods that will allow its use there, you'll also need to add the new method signature to the Searcher interface, which is implemented by IndexSearcher. Since DistributedSearch implements the Searcher interface as well, you'll need to provide there a method with the new siganature. Besides, depending on your needs, Summarizer and Query will demand some changes in order to preserve phrases (composite search terms) when they are highlighted in the summary. Let me remind you that all this must be done just to provide something that's already there: Nutch is built on top of Lucene, after all. If it's hard to understand why Lucene's capabilities were simply neutralized in Nutch, it's even harder to figure out why no choice was left to users by means of some configuration file. Regards, Cristina
Re: Lucene query support in Nutch
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, Am 07.10.2006 um 17:40 schrieb Cristina Belderrain: Let me remind you that all this must be done just to provide something that's already there: Nutch is built on top of Lucene, after all. If it's hard to understand why Lucene's capabilities were simply neutralized in Nutch, it's even harder to figure out why no choice was left to users by means of some configuration file. I think this issue is rooted in the underlying philosophy of Nutch: Nutch was designed with the idea of a possible Google(and the likes)- sized crawler and indexer in mind. Regular expressions and wildcard queries do not seem to fit into this philosophy, as such queries would be way less efficient on a huge data set than simple boolean queries. Nevertheless, I agree that there should be an option to choose the Lucene query engine instead of the Nutch flavour one because Nutch has been proven to be equally suitable for areas which do not require as efficient queries (like intranet crawling for instance) as an all- out web indexing application. - -- Best regards, Björn Wilmsmann -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (Darwin) iD8DBQFFJ+75gz0R1bg11MERAgT7AJ4mPRF8Z0BR2yLCm5Pxsz4VvtTI6QCfcS8b q8gM8LQapjAloNIRwNV+osE= =v7Lf -END PGP SIGNATURE-
Re: Lucene query support in Nutch
Nevertheless, I agree that there should be an option to choose the Lucene query engine instead of the Nutch flavour one because Nutch has been proven to be equally suitable for areas which do not require as efficient queries (like intranet crawling for instance) as an all-out web indexing application. I agree also. Different query parsers could perhaps be made pluggable or at least configurable. The current(-alike) implementation could be the default one offered and by configuration one could switch it to intranet mode. Contributions anyone? -- Sami Siren
Re: Lucene query support in Nutch
Björn Wilmsmann wrote: Am 07.10.2006 um 17:40 schrieb Cristina Belderrain: Let me remind you that all this must be done just to provide something that's already there: Nutch is built on top of Lucene, after all. If it's hard to understand why Lucene's capabilities were simply neutralized in Nutch, it's even harder to figure out why no choice was left to users by means of some configuration file. I think this issue is rooted in the underlying philosophy of Nutch: Nutch was designed with the idea of a possible Google(and the likes)-sized crawler and indexer in mind. Regular expressions and wildcard queries do not seem to fit into this philosophy, as such queries would be way less efficient on a huge data set than simple boolean queries. Nevertheless, I agree that there should be an option to choose the Lucene query engine instead of the Nutch flavour one because Nutch has been proven to be equally suitable for areas which do not require as efficient queries (like intranet crawling for instance) as an all-out web indexing application. Hi, if it's not the full feature-set, maybe most people could live with it. But basic boolean queries I think were the root for this topic. Is there an easier way to allow this in Nutch as well instead of throwing quite a bit away and using the Lucene-syntax? As has just been pointed out: It seems quite a few things need to be changed to use Lucene-search instead of a Nutch-search. I don't think that it's needed in most cases. But I see several reasons where a boolean query would make sense. (Currently I do fetch up to 10.000 or so results using opensearch and filter them in a script myself, since no AND (site:... or site:...) is yet possible.) Regards, Stefan
Re: Lucene query support in Nutch
Hi, yes, I guess having the full strength of Lucene-based queries would be nice. That would as well solve the boolean queries-question I had a few days ago :-) Ravi, doesn't Lucene also allow querying of other fields? Is there any possibility to add that feature to your proposal? In general: What is the advantage of the current nutch-parser instead of going with the Lucene-based one? Regards, Stefan Ravi Chintakunta wrote: Hi Cristina, You can achieve this by modifying the IndexSearcher to take the query String as an argument and then use org.apache.lucene.queryParser.QueryParser's parse(String ) method to parse the query string. The modified method in IndexSearcher would look as below: public Hits search(String queryString, int numHits, String dedupField, String sortField, boolean reverse) throws IOException { org.apache.lucene.queryParser.QueryParser parser = new org.apache.lucene.queryParser.QueryParser(content, new org.apache.lucene.analysis.standard.StandardAnalyzer()); org.apache.lucene.search.Query luceneQuery = parser.parse(queryString); return translateHits (optimizer.optimize(luceneQuery, luceneSearcher, numHits, sortField, reverse), dedupField, sortField); } For this you have to modify the code in search.jsp and NutchBean too, so that you are passing on the raw query string to IndexSearcher. Note that with this approach, you are limiting the search to the content field. - Ravi Chintakunta On 10/4/06, Cristina Belderrain [EMAIL PROTECTED] wrote: Hello, we all know that Lucene supports, among others, boolean queries. Even though Nutch is built on Lucene, boolean clauses are removed by Nutch filters so boolean queries end up as flat queries where terms are implicitly connected by an OR operator, as far as I can see. Is there any simple way to turn off the filtering so a boolean query remains as such after it is submitted to Nutch? Just in case a simple way doesn't exist, Ravi Chintakunta suggests the following workaround: We have to modify the analyzer and add more plugins to Nutch to use the Lucene's query syntax. Or we have to directly use Lucene's Query Parser. I tried the second approach by modifying org.apache.nutch.searcher.IndexSearcher and that seems to work. Can anyone please elaborate on what Ravi actually means by modifying org.apache.nutch.searcher.IndexSearcher? Which methods are supposed to be modified and how? It would be really nice to know how to do this. I believe many other Nutch users would also benefit from an answer to this question. Thanks so much, Cristina
Re: Lucene query support in Nutch
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi everybody, On 05/10/2006 05:44 Ravi Chintakunta wrote: public Hits search(String queryString, int numHits, String dedupField, String sortField, boolean reverse) throws IOException { org.apache.lucene.queryParser.QueryParser parser = new org.apache.lucene.queryParser.QueryParser(content, new org.apache.lucene.analysis.standard.StandardAnalyzer()); org.apache.lucene.search.Query luceneQuery = parser.parse (queryString); return translateHits (optimizer.optimize(luceneQuery, luceneSearcher, numHits, sortField, reverse), dedupField, sortField); } This seems to be a good approach. I have not yet tried it out in detail, however, the method optimize() in LuceneQueryOptimizer does only take BooleanQuery as an argument, so the line 'return translateHits...' would cause a compile error, wouldn't it? - -- Best regards, Björn Wilmsmann -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (Darwin) iD8DBQFFJV9Fgz0R1bg11MERAt3sAJ4pKJ8voEhWSo+94SI6bam4iVPYgACbBQmm sFAZIcCv3CoIBJC5g8FbOyo= =vzdw -END PGP SIGNATURE-
Re: Lucene query support in Nutch
Hi Björn, yes, the error you point out will happen indeed... A possible workaround would be: public Hits search(String queryString, int numHits, String dedupField, String sortField, boolean reverse) throws IOException { org.apache.lucene.queryParser.QueryParser parser = new org.apache.lucene.queryParser.QueryParser(content, new org.apache.lucene.analysis.standard.StandardAnalyzer()); org.apache.lucene.search.Query luceneQuery = null; try { luceneQuery = parser.parse(queryString); } catch(Exception ex) { } org.apache.lucene.search.BooleanQuery boolQuery = new org.apache.lucene.search.BooleanQuery(); boolQuery.add(luceneQuery, org.apache.lucene.search.BooleanClause.Occur.MUST); return translateHits (optimizer.optimize(boolQuery, luceneSearcher, numHits, sortField, reverse), dedupField, sortField); } Please notice that I'm not sure this will work as it should: right now, it just compiles... I still need to modify the NutchBean class so it can pass on the raw query, as Ravi says. Regards, Cristina On 10/5/06, Björn Wilmsmann [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi everybody, On 05/10/2006 05:44 Ravi Chintakunta wrote: public Hits search(String queryString, int numHits, String dedupField, String sortField, boolean reverse) throws IOException { org.apache.lucene.queryParser.QueryParser parser = new org.apache.lucene.queryParser.QueryParser(content, new org.apache.lucene.analysis.standard.StandardAnalyzer()); org.apache.lucene.search.Query luceneQuery = parser.parse (queryString); return translateHits (optimizer.optimize(luceneQuery, luceneSearcher, numHits, sortField, reverse), dedupField, sortField); } This seems to be a good approach. I have not yet tried it out in detail, however, the method optimize() in LuceneQueryOptimizer does only take BooleanQuery as an argument, so the line 'return translateHits...' would cause a compile error, wouldn't it? - -- Best regards, Björn Wilmsmann -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (Darwin) iD8DBQFFJV9Fgz0R1bg11MERAt3sAJ4pKJ8voEhWSo+94SI6bam4iVPYgACbBQmm sFAZIcCv3CoIBJC5g8FbOyo= =vzdw -END PGP SIGNATURE-
OpenOffice Support?
Just wondering, has anyone done any work on a plugin (or aware of a plugin) that supports the indexing of open office documents? Thanks. Matt
Re: OpenOffice Support?
Using to advantage your question, anyone knows if the version 0.7.2 of nutch supports the zip plugin? If so, where can I find it? Lourival Junior On 7/11/06, Matthew Holt [EMAIL PROTECTED] wrote: Just wondering, has anyone done any work on a plugin (or aware of a plugin) that supports the indexing of open office documents? Thanks. Matt -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: Add Wyona to the wiki support page?
Renaud Richardet wrote: Hello Nutch, My name is Renaud Richardet and I am the COO of Wyona LLC. We are offering Nutch and Lucene support (http://wyona.com/lucene.html), and I was wondering if I could add our company to http://wiki.apache.org/nutch/Support. That would be great. Certainly, you can add a short note about your company on the support page. It's a Wiki, so you can just create an account, log in, and edit this page (please use the preview button to check the changes before saving). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Add Wyona to the wiki support page?
The funny thing about that wiki page (and some others in that area) is that they apparently use the nofollow tags. Given the topic of that wiki, isn't that a bit odd? I personally dislike the nofollow tag and think it should be used only in extreme circumstances (i.e. here's a link to a site you absolutely don't want to visit). I believe in this case however it's simply being used so that sites that are listed don't get any pagerank/weight/whatever passed to them from an authority site. A really bizarre policy for a search related site IMO. Swinging back on topic, does nutch obey the nofollow tags? g. Andrzej Bialecki wrote: Renaud Richardet wrote: Hello Nutch, My name is Renaud Richardet and I am the COO of Wyona LLC. We are offering Nutch and Lucene support (http://wyona.com/lucene.html), and I was wondering if I could add our company to http://wiki.apache.org/nutch/Support. That would be great. Certainly, you can add a short note about your company on the support page. It's a Wiki, so you can just create an account, log in, and edit this page (please use the preview button to check the changes before saving).
Re: Add Wyona to the wiki support page?
Insurance Squared Inc. wrote: The funny thing about that wiki page (and some others in that area) is that they apparently use the nofollow tags. Given the topic of that wiki, isn't that a bit odd? I personally dislike the nofollow tag and think it should be used only in extreme circumstances (i.e. here's a link to a site you absolutely don't want to visit). I believe in this case however it's simply being used so that sites that are listed don't get any pagerank/weight/whatever passed to them from an authority site. A really bizarre policy for a search related site IMO. I think it's a default setting for the Wiki, which nobody bothered to change... Swinging back on topic, does nutch obey the nofollow tags? Yes. Please see HtmlParser and HTMLMetaTags classes for details. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Add Wyona to the wiki support page?
Well so much for knee-jerk suspicions as to intent. No need to look for conspiracy theories when default settings are more likely to be the cause. That should probably a corollary to occam's razor or something :). Andrzej Bialecki wrote: Insurance Squared Inc. wrote: The funny thing about that wiki page (and some others in that area) is that they apparently use the nofollow tags. Given the topic of that wiki, isn't that a bit odd? I personally dislike the nofollow tag and think it should be used only in extreme circumstances (i.e. here's a link to a site you absolutely don't want to visit). I believe in this case however it's simply being used so that sites that are listed don't get any pagerank/weight/whatever passed to them from an authority site. A really bizarre policy for a search related site IMO. I think it's a default setting for the Wiki, which nobody bothered to change... Swinging back on topic, does nutch obey the nofollow tags? Yes. Please see HtmlParser and HTMLMetaTags classes for details.
Re: Full fledged Lucene Query Syntax support in Nutch
Performance might be a reason, but only the queries that include wildcards or fuzzy characters would be slowed down but not all the queries right? The regular plain text searches performance shouldn't be affected. Any thoughts? Thanks, Ravi Chintakunta On 5/3/06, Ravish Bhagdev [EMAIL PROTECTED] wrote: reason is performance. Allowing above means more complex query which causes more dealy in getting results. If you need these features, you know how to get them, but its tradeoff with performance. May be not if number of pages are less, it will on large scale. -- Ravish. On 5/2/06, Ravi Chintakunta [EMAIL PROTECTED] wrote: Lucene supports fuzzy, wildcard, range, proximity searches as listed here: http://lucene.apache.org/java/docs/queryparsersyntax.html But Nutch does not use all these capabilities. It is limited by query parsing in org.apache.nutch.analysis.NutchAnalysis and the query filters hosted in plugins. We have to modify the analyzer and add more plugins to Nutch to use the Lucene's query syntax. Or we have to directly use Lucene's Query Parser. I tried the second approach by modifying org.apache.nutch.searcher.IndexSearcher and that seems to work. Is there a reason that Nutch does not support the entire Lucene query syntax by default? Thanks in advance, Ravi Chintakunta
Full fledged Lucene Query Syntax support in Nutch
Lucene supports fuzzy, wildcard, range, proximity searches as listed here: http://lucene.apache.org/java/docs/queryparsersyntax.html But Nutch does not use all these capabilities. It is limited by query parsing in org.apache.nutch.analysis.NutchAnalysis and the query filters hosted in plugins. We have to modify the analyzer and add more plugins to Nutch to use the Lucene's query syntax. Or we have to directly use Lucene's Query Parser. I tried the second approach by modifying org.apache.nutch.searcher.IndexSearcher and that seems to work. Is there a reason that Nutch does not support the entire Lucene query syntax by default? Thanks in advance, Ravi Chintakunta
Re: Full fledged Lucene Query Syntax support in Nutch
Sorry, I am on holiday until the 8th of May. Please contact the [EMAIL PROTECTED] for urgent matters. Kind regards, Herman.
HTTPS support?
Hi, Does Nutch 0.8 support https fetches? If not, are there any active efforts to support it? TIA, David Odmark
Re: HTTPS support?
David Odmark wrote: Hi, Does Nutch 0.8 support https fetches? If not, are there any active efforts to support it? It does, using protocol-httpclient plugin. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Nutch doesn't support Korean?
I was browing NutchAnalysis.jj and found that Hungul Syllables (U+AC00 ... U+D7AF; U+ means a Unicode character of the hex value ) are not part of LETTER or CJK class. This seems to me that Nutch cannot handle Korean documents at all. Is anybody successfully using Nutch for Korean? -kuro
Re: Nutch doesn't support Korean?
Hello, There was similar issue with Lucene's StandardTokenizer.jj. http://issues.apache.org/jira/browse/LUCENE-444 and http://issues.apache.org/jira/browse/LUCENE-461 I'm have almost no experience with Nutch, but you can handle it like those issues above. On 3/4/06, Teruhiko Kurosaka [EMAIL PROTECTED] wrote: I was browing NutchAnalysis.jj and found that Hungul Syllables (U+AC00 ... U+D7AF; U+ means a Unicode character of the hex value ) are not part of LETTER or CJK class. This seems to me that Nutch cannot handle Korean documents at all. Is anybody successfully using Nutch for Korean? -kuro -- Cheolgoo
xquery support for nutch
Hi It would be great if we provide xquery support to nutch where expressions like 3 + 4=7 would be evaluated. http://www.xml.com/pub/a/2002/10/16/xquery.html It is just an idea and probably would make it a universal tool Rgds Prabhu
Single NutchBean and multiple indices support
Hi there. I am facing the same the question and looking for same solution. Your solution seems easy:) My question is what file system the application runs on? LocalFileSystem or DistributedFileSystem? Thanks /Jack On 2/9/06, Ravi Chintakunta [EMAIL PROTECTED] wrote: Hi David, Thanks for your reply. After posting the question, I have done this in a more optimum way. - I used only a single NutchBean and modified it so that the search method takes the indices being searched as an argument. This single NutchBean creates separate IndexReaders on the merged indices in the directories and keeps them in a map. - Based on the indexes that are searched, NutchBean creates an IndexSearcher using the appropriate IndexReaders. I have added a constructor to IndexSearcher that takes an array of IndexReaders and uses a MultiReader to initialize itself. - The NutchBean creates a single FetchedSegments with the combination of the segments directories in all the directories. The advantages with this are: - A single IndexReader for an index - so no additional filehandles are created. - No opening / closing of readers or segments - this improves performance. - Ravi Chintakunta This is almost exactly what I've done. I create a new NutchBean for each search, and point it at whichever of 9 subdirectories the user has selected; because I really don't want 511 (2^9-1) beans hanging around. The reason for the too many open files is that the NutchBean doesn't clean up after itself - I guess because for most people, the NutchBean is going to be reused. I added a close() method to FetchSegments.Segment in my installation, to close all the readers. I added a closeSegments() method to NutchBean, to call close() on each segment that's been opened. Then I call closeSegments() after each search. I realise that NutchBean really wasn't designed to support being instantiated once per search, but I don't care. It works well, and performance is not an issue. Regards, David. Date: Mon, 6 Feb 2006 20:59:34 -0500 From: Ravi Chintakunta [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Subject: [Nutch-general] Dynamic merging of indices Reply-To: [EMAIL PROTECTED] I have multiple indices for the crawls across various intranet sites stored in separate folders. My search application should support searching across one or more of these indices dynamically - by way of checkboxes on the web page. For this, I have modified NutchBean to create the IndexSearcher and FetchedSegments from the segments directory (not the merged index directory) in these folders. Based on the selected intranet sites, a NutchBean is instantiated for the indices of the selected sites and the results are displayed. With this I had the Too many open files error and have increased the number of files limit. This seems to work well now. But if I have 5 such sites, then I am opening 2^5 =3D 32 times more files than I would have opened. My question is: Is there a better way of doing this? Like: - Can I open an IndexReader on each of the merged index directory and dynamically create an IndexSearcher by merging these readers using MultiReader? - Is an IndexReader thread safe and can it be used simultaneously in different IndexSearchers? - Can I create the IndexReader on the merged index directory and create the corresponding FetchedSegments on the corresponding non-merged segments directory? Thanks Ravi Chintakunta This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA. All emails have been scanned for viruses and content by MailMarshal. NZQA reserves the right to monitor all email communications through its network. -- Keep Discovering ... ... http://www.jroller.com/page/jmars
Re: Single NutchBean and multiple indices support
Hi Jack, It runs on a local file system. - Ravi Chintakunta On 2/15/06, Jack Tang [EMAIL PROTECTED] wrote: Hi there. I am facing the same the question and looking for same solution. Your solution seems easy:) My question is what file system the application runs on? LocalFileSystem or DistributedFileSystem? Thanks /Jack On 2/9/06, Ravi Chintakunta [EMAIL PROTECTED] wrote: Hi David, Thanks for your reply. After posting the question, I have done this in a more optimum way. - I used only a single NutchBean and modified it so that the search method takes the indices being searched as an argument. This single NutchBean creates separate IndexReaders on the merged indices in the directories and keeps them in a map. - Based on the indexes that are searched, NutchBean creates an IndexSearcher using the appropriate IndexReaders. I have added a constructor to IndexSearcher that takes an array of IndexReaders and uses a MultiReader to initialize itself. - The NutchBean creates a single FetchedSegments with the combination of the segments directories in all the directories. The advantages with this are: - A single IndexReader for an index - so no additional filehandles are created. - No opening / closing of readers or segments - this improves performance. - Ravi Chintakunta This is almost exactly what I've done. I create a new NutchBean for each search, and point it at whichever of 9 subdirectories the user has selected; because I really don't want 511 (2^9-1) beans hanging around. The reason for the too many open files is that the NutchBean doesn't clean up after itself - I guess because for most people, the NutchBean is going to be reused. I added a close() method to FetchSegments.Segment in my installation, to close all the readers. I added a closeSegments() method to NutchBean, to call close() on each segment that's been opened. Then I call closeSegments() after each search. I realise that NutchBean really wasn't designed to support being instantiated once per search, but I don't care. It works well, and performance is not an issue. Regards, David. Date: Mon, 6 Feb 2006 20:59:34 -0500 From: Ravi Chintakunta [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Subject: [Nutch-general] Dynamic merging of indices Reply-To: [EMAIL PROTECTED] I have multiple indices for the crawls across various intranet sites stored in separate folders. My search application should support searching across one or more of these indices dynamically - by way of checkboxes on the web page. For this, I have modified NutchBean to create the IndexSearcher and FetchedSegments from the segments directory (not the merged index directory) in these folders. Based on the selected intranet sites, a NutchBean is instantiated for the indices of the selected sites and the results are displayed. With this I had the Too many open files error and have increased the number of files limit. This seems to work well now. But if I have 5 such sites, then I am opening 2^5 =3D 32 times more files than I would have opened. My question is: Is there a better way of doing this? Like: - Can I open an IndexReader on each of the merged index directory and dynamically create an IndexSearcher by merging these readers using MultiReader? - Is an IndexReader thread safe and can it be used simultaneously in different IndexSearchers? - Can I create the IndexReader on the merged index directory and create the corresponding FetchedSegments on the corresponding non-merged segments directory? Thanks Ravi Chintakunta This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA. All emails have been scanned for viruses and content by MailMarshal. NZQA reserves the right to monitor all email communications through its network. -- Keep Discovering ... ... http://www.jroller.com/page/jmars
Re: Which version of rss does parse-rss plugin support?
Hi, the contentTitle will be a concatenation of the titles of the RSS Channels that we've parsed. So the titles of the RSS Channels are what delivered for indexing, right? They're certainly part of it, but not the only part. The concatenation of the titles of the RSS Channels are what is delivered for the title portion of indexing. If I want the indexer to include more information about a rss file (such as item descriptions), can I just concatenate them to the contentTitle? They're already there. There is a variable called index text: ultimately that variable includes the item descriptions, along with the channel descriptions. That, along with the title portion of indexing is the full set of textual data delivered by the parser for indexing. So, it already includes that information. Check out lines 137, and 161 in the parser to see what I mean. Also, check out lines 204-207, which are: ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, contentTitle.toString(), outlinks, content.getMetadata()); parseData.setConf(this.conf); return new ParseImpl(indexText.toString(), parseData); You can see that the return from the Parser, i.e., the ParseImpl, includes both the indexText, along with the parse data (that contains the title text). Now, if you wanted to add any other metadata gleaned from the RSS to the title text, or the content text, you can always modify the code to do that in your own environment. The RSS Parser plugin returns a full channel model and item model that can be extended and used for those purposes. Hope that helps! Cheers, Chris 在06-2-6,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, That should work: however, the biggest problem will be making sure that text/xml is actually the content type of the RSS that you are parsing, which you'll have little or no control over. Check out this previous post of mine on the list to get a better idea of what the real issue is: http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html G'luck! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 Phone: 818-354-8810 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: 盖世豪侠 [mailto:[EMAIL PROTECTED] Sent: Saturday, February 04, 2006 11:40 PM To: nutch-user@lucene.apache.org Subject: Re: Which version of rss does parse-rss plugin support? Hi Chris How do I change the plugin.xml? For example, if I want to crawl rss files end with xml, just add a new element? implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=rss/ implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=xml/ Am I right? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, Sure it will, you just have to configure it to do that. Pop over to $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there is an attribute called pathSuffix. Change that to handle whatever type of rss file you want to crawl. That will work locally. For web-based crawls, you need to make sure that the content type being returned for your RSS content matches the content type specified in the plugin.xml file that parse-rss claims to support. Note that you might not have * a lot * of success with being able to control the content type for rss files returned by web servers. I've seen a LOT of inconsistency among the way that they're configured by the administrators, etc. However, just to let you know, there are some people in the group that are working on a solution to addressing this. Hope that helps. Cheers, Chris On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: Hi *Chris,* The files of RSS 1.0 have a postfix of rdf. So willthe parser recognize it automatically as a rss file? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, parse-rss is based on commons-feedparser (http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser website: ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability... Hope that helps
Re: Which version of rss does parse-rss plugin support?
According to the code: theOutlinks.add(new Outlink(r.getLink(), r .getDescription())); I can see that item description is also included. However, when I tried with this feed: http://kgrimm.bravejournal.com/feed.rss I can only get the title and description for channel and failed to search the words in item description. From the above code, the item description is combined with outlink url, is it used as contentTitle for that url? When the outlink is fetched and parsed, I think new data about that url will be generated. 在06-2-11,Chris Mattmann [EMAIL PROTECTED] 写道: Hi, the contentTitle will be a concatenation of the titles of the RSS Channels that we've parsed. So the titles of the RSS Channels are what delivered for indexing, right? They're certainly part of it, but not the only part. The concatenation of the titles of the RSS Channels are what is delivered for the title portion of indexing. If I want the indexer to include more information about a rss file (such as item descriptions), can I just concatenate them to the contentTitle? They're already there. There is a variable called index text: ultimately that variable includes the item descriptions, along with the channel descriptions. That, along with the title portion of indexing is the full set of textual data delivered by the parser for indexing. So, it already includes that information. Check out lines 137, and 161 in the parser to see what I mean. Also, check out lines 204-207, which are: ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, contentTitle.toString(), outlinks, content.getMetadata()); parseData.setConf(this.conf); return new ParseImpl(indexText.toString(), parseData); You can see that the return from the Parser, i.e., the ParseImpl, includes both the indexText, along with the parse data (that contains the title text). Now, if you wanted to add any other metadata gleaned from the RSS to the title text, or the content text, you can always modify the code to do that in your own environment. The RSS Parser plugin returns a full channel model and item model that can be extended and used for those purposes. Hope that helps! Cheers, Chris 在06-2-6,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, That should work: however, the biggest problem will be making sure that text/xml is actually the content type of the RSS that you are parsing, which you'll have little or no control over. Check out this previous post of mine on the list to get a better idea of what the real issue is: http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html G'luck! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 Phone: 818-354-8810 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: 盖世豪侠 [mailto:[EMAIL PROTECTED] Sent: Saturday, February 04, 2006 11:40 PM To: nutch-user@lucene.apache.org Subject: Re: Which version of rss does parse-rss plugin support? Hi Chris How do I change the plugin.xml? For example, if I want to crawl rss files end with xml, just add a new element? implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=rss/ implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=xml/ Am I right? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, Sure it will, you just have to configure it to do that. Pop over to $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there is an attribute called pathSuffix. Change that to handle whatever type of rss file you want to crawl. That will work locally. For web-based crawls, you need to make sure that the content type being returned for your RSS content matches the content type specified in the plugin.xml file that parse-rss claims to support. Note that you might not have * a lot * of success with being able to control the content type for rss files returned by web servers. I've seen a LOT of inconsistency among the way that they're configured by the administrators, etc. However, just
opensearch support
Is OpenSearch being developed? I am using nutch 0.7 and it seems to have some opensearch support. However, I failed to get either a python or perl opensearch client library (admittedly these are also in early development). The perl library seemed to choke at not finding the OpenSearchDescription, I didn't have enough time to investigate. I can of course, just post and parse the xml search results manually. Thanks, Geraint
RE: Which version of rss does parse-rss plugin support?
Hi there, That should work: however, the biggest problem will be making sure that text/xml is actually the content type of the RSS that you are parsing, which you'll have little or no control over. Check out this previous post of mine on the list to get a better idea of what the real issue is: http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html G'luck! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 Phone: 818-354-8810 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: 盖世豪侠 [mailto:[EMAIL PROTECTED] Sent: Saturday, February 04, 2006 11:40 PM To: nutch-user@lucene.apache.org Subject: Re: Which version of rss does parse-rss plugin support? Hi Chris How do I change the plugin.xml? For example, if I want to crawl rss files end with xml, just add a new element? implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=rss/ implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=xml/ Am I right? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, Sure it will, you just have to configure it to do that. Pop over to $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there is an attribute called pathSuffix. Change that to handle whatever type of rss file you want to crawl. That will work locally. For web-based crawls, you need to make sure that the content type being returned for your RSS content matches the content type specified in the plugin.xml file that parse-rss claims to support. Note that you might not have * a lot * of success with being able to control the content type for rss files returned by web servers. I've seen a LOT of inconsistency among the way that they're configured by the administrators, etc. However, just to let you know, there are some people in the group that are working on a solution to addressing this. Hope that helps. Cheers, Chris On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: Hi *Chris,* The files of RSS 1.0 have a postfix of rdf. So willthe parser recognize it automatically as a rss file? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, parse-rss is based on commons-feedparser (http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser website: ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability... Hope that helps. Thanks, Chris On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: I see the test file is of version 0.91. Does the plugin support higher versions like 1.0 or 2.0? -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周 星驰岂是池中物,喜剧天 分 既 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既 得千里马,又失千里马, 当 然 后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星 驰岂是池中物,喜剧天分既 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然 后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂 是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一 展风采。无线既得千里马,又失千里马,当然后悔莫及。
Re: Which version of rss does parse-rss plugin support?
Hi Chris, Thank you for your post and I've read it through. So, you mean I should also add these lines to the plugin.xml in most cases: implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=rss/ ... implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=text/xml pathSuffix=xml/ implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=text/xml pathSuffix=rss/ 在06-2-6,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, That should work: however, the biggest problem will be making sure that text/xml is actually the content type of the RSS that you are parsing, which you'll have little or no control over. Check out this previous post of mine on the list to get a better idea of what the real issue is: http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html G'luck! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 Phone: 818-354-8810 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: 盖世豪侠 [mailto:[EMAIL PROTECTED] Sent: Saturday, February 04, 2006 11:40 PM To: nutch-user@lucene.apache.org Subject: Re: Which version of rss does parse-rss plugin support? Hi Chris How do I change the plugin.xml? For example, if I want to crawl rss files end with xml, just add a new element? implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=rss/ implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=xml/ Am I right? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, Sure it will, you just have to configure it to do that. Pop over to $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there is an attribute called pathSuffix. Change that to handle whatever type of rss file you want to crawl. That will work locally. For web-based crawls, you need to make sure that the content type being returned for your RSS content matches the content type specified in the plugin.xml file that parse-rss claims to support. Note that you might not have * a lot * of success with being able to control the content type for rss files returned by web servers. I've seen a LOT of inconsistency among the way that they're configured by the administrators, etc. However, just to let you know, there are some people in the group that are working on a solution to addressing this. Hope that helps. Cheers, Chris On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: Hi *Chris,* The files of RSS 1.0 have a postfix of rdf. So willthe parser recognize it automatically as a rss file? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, parse-rss is based on commons-feedparser (http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser website: ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability... Hope that helps. Thanks, Chris On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: I see the test file is of version 0.91. Does the plugin support higher versions like 1.0 or 2.0? -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周 星驰岂是池中物,喜剧天 分 既 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既 得千里马,又失千里马, 当 然 后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星 驰岂是池中物,喜剧天分既 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然 后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂 是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一 展风采。无线既得千里马,又失千里马,当然后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然
Re: Which version of rss does parse-rss plugin support?
Hi Chris How do I change the plugin.xml? For example, if I want to crawl rss files end with xml, just add a new element? implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=rss/ implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=xml/ Am I right? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, Sure it will, you just have to configure it to do that. Pop over to $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there is an attribute called pathSuffix. Change that to handle whatever type of rss file you want to crawl. That will work locally. For web-based crawls, you need to make sure that the content type being returned for your RSS content matches the content type specified in the plugin.xml file that parse-rss claims to support. Note that you might not have * a lot * of success with being able to control the content type for rss files returned by web servers. I've seen a LOT of inconsistency among the way that they're configured by the administrators, etc. However, just to let you know, there are some people in the group that are working on a solution to addressing this. Hope that helps. Cheers, Chris On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: Hi *Chris,* The files of RSS 1.0 have a postfix of rdf. So willthe parser recognize it automatically as a rss file? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, parse-rss is based on commons-feedparser (http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser website: ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability... Hope that helps. Thanks, Chris On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: I see the test file is of version 0.91. Does the plugin support higher versions like 1.0 or 2.0? -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天 分 既 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马, 当 然 后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然 后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然后悔莫及。
Which version of rss does parse-rss plugin support?
I see the test file is of version 0.91. Does the plugin support higher versions like 1.0 or 2.0? -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然后悔莫及。
Re: Which version of rss does parse-rss plugin support?
Hi there, parse-rss is based on commons-feedparser (http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser website: ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability... Hope that helps. Thanks, Chris On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: I see the test file is of version 0.91. Does the plugin support higher versions like 1.0 or 2.0? -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然 后悔莫及。
Re: Which version of rss does parse-rss plugin support?
Hi *Chris,* The files of RSS 1.0 have a postfix of rdf. So willthe parser recognize it automatically as a rss file? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, parse-rss is based on commons-feedparser (http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser website: ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability... Hope that helps. Thanks, Chris On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: I see the test file is of version 0.91. Does the plugin support higher versions like 1.0 or 2.0? -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然 后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然后悔莫及。
Multi CPU support
Can I use MapReduce to run Nutch on a multi CPU system? I want to run the index job on two (or four) CPUs on a single system. I'm not trying to distribute the job over multiple systems. If the MapReduce is the way to go, do I just specify config parameters like these: mapred.tasktracker.tasks.maxiumum=2 mapred.job.tracker=localhost:9001 mapred.reduce.tasks=2 (or 1?) and bin/start-all.sh ? Must I use NDFS for MapReduce? Do I need to do anything else to make sure that the two processes run on different CPUs? Is this the only way to take advantage of a multi CPU system? -kuro
Re: Multi CPU support
Teruhiko Kurosaka wrote: Can I use MapReduce to run Nutch on a multi CPU system? Yes. I want to run the index job on two (or four) CPUs on a single system. I'm not trying to distribute the job over multiple systems. If the MapReduce is the way to go, do I just specify config parameters like these: mapred.tasktracker.tasks.maxiumum=2 mapred.job.tracker=localhost:9001 mapred.reduce.tasks=2 (or 1?) and bin/start-all.sh ? That should work. You'd probably want to set the default number of map tasks to be a multiple of the number of CPUs, and the number of reduce tasks to be exactly the number of cpus. Don't use start-all.sh, but rather just: bin/nutch-daemon.sh start tasktracker bin/nutch-daemon.sh start jobtracker Must I use NDFS for MapReduce? No. Doug
multibyte character support status
What is the current state and plan for multibyte character support by Nutch? As far as I can tell... The PDF plugin uses PDFBox (www.pdfbox.org) which does not work with Japanese and probably other multibyte characters and code sets. The Word plugin uses POI (http://jakarta.apache.org/poi/), which doesn't seem to support Japanese. Some patches to make it possible to support Japanese (and hopefully other code sets) have been submitted to the POI project but they have not been integrated because the project currently has no committer. RTF document plugin and PowerPoint plugin use home-grown parsers. What is the status of multibyte code set (and single byte code set other than ISO-8859-1) support by these plugins? -Kuro
Re: PDF indexing support?
Tanks it worked Jérôme Charron wrote: The value you specified is biggest than the maximal int value, so that it return an exception, and then the default value is used. As mentionned in the property's description, use a negative value (-1) for no truncation at all (or a value lesser than java.lang.Interger.MAX_VALUE). Regards Jérôme On 11/16/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote: Have now added conf/nutch-site.xml but still the same problem. | Related to the problem? http://sourceforge.net/forum/message.php?msg_id=3391668 http://sourceforge.net/forum/message.php?msg_id=3398773 ?xml version=1.0? ?xml-stylesheet type=text/xsl href=nutch-conf.xsl? nutch-conf property namehttp.content.limit/name value45451515565536/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property /nutch-conf Håvard W. Kongsgård wrote: HTTP Sébastien LE CALLONNEC wrote: Hej Håvard, That's because you have to create one yourself. The values you will set in there will override the default values. Here are a few more questions to try to solve your problem: where is your PDF located? What protocol is used to fetch it (HTTP, FTP, etc.)? Regards, /sebastien --- Håvard W. Kongsgård [EMAIL PROTECTED] a écrit : Don't have a conf/nutch-site.xml Jérôme Charron wrote: conf/nutch-default Checks that they are not overrided in the conf/nutch-site If no, sorry, no more idea for now :-( Jérôme -- http://motrech.free.fr/ http://www.frutch.org/ No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.13.1/169 - Release Date: 15.11.2005 ___ Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger Téléchargez cette version sur http://fr.messenger.yahoo.com -- http://motrech.free.fr/ http://www.frutch.org/ No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.13.3/173 - Release Date: 16.11.2005
Re: PDF indexing support?
On Nov 15, 2005, at 2:46 PM, Håvard W. Kongsgård wrote: Don't have a conf/nutch-site.xml Create it and put the overrides in there, per the nutch tutorial. Cheers, Hasan Diwan [EMAIL PROTECTED] PGP.sig Description: This is a digitally signed message part
Re: PDF indexing support?
Verify that you have the very latest PDFBOX from there website. A lot of people notice that pdf box is a little bit buggy. Stefan Am 15.11.2005 um 22:27 schrieb Håvard W. Kongsgård: Nutch won't index some of my PDF files I get this error: reason: failed(2,202): Content truncated at 66608 bytes. Parser can't handle incomplete pdf file. Is there a bug in the pdfplugin (PDFBOX) I am using nutch 0.7.1. I know from experience that some pdf to text programs like xpdf have some problems with pdf v 1.6(adobe acrobat 7/CS). Jérôme Charron wrote: Hello I new with nutch how do I enable PDF indexing support? Simply by activating the parse-pdf plugin in nutch-default.xml or nutch-site.xml (take a look at the plugin.includes property) Jérôme -- http://motrech.free.fr/ http://www.frutch.org/ - --- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.13.1/169 - Release Date: 15.11.2005 --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Re: PDF indexing support?
conf/nutch-default Jérôme Charron wrote: http.content.limit=542256565536 and file.content.limit=4541165536 still the same error: where do you specify these values? in nutch-default or nutch-site? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/ No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.13.1/169 - Release Date: 15.11.2005
Re: PDF indexing support?
conf/nutch-default Checks that they are not overrided in the conf/nutch-site If no, sorry, no more idea for now :-( Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: PDF indexing support?
Don't have a conf/nutch-site.xml Jérôme Charron wrote: conf/nutch-default Checks that they are not overrided in the conf/nutch-site If no, sorry, no more idea for now :-( Jérôme -- http://motrech.free.fr/ http://www.frutch.org/ No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.13.1/169 - Release Date: 15.11.2005
PDF indexing support?
Hello I new with nutch how do I enable PDF indexing support?
PDF support? Does crawl parse p
Does Nutch have a way to parse pdf files, that is, application/pdf content type files? I noticed a plugin variable setting in default.properties: plugin.pdf=org.apache.nutch.parse.pdf* I never changed this file. Is that the right value? I am using Nutch 0.7. What do I have to do make parse pdf files? When I do the crawl, I get this error with application/pdf files: 050831 145126 fetch okay, but can't parse mainurl/research/126900/126969/126969.pdf, reason: failed(2,203): Content-Type not text/html: application/pdf If it's not possible, what future version of Nutch do developers expect to support application/pdf types and have such parsing of pdf files available? Diane Palla Web Services Developer Seton Hall University 973 313-6199 [EMAIL PROTECTED] Bryan Woliner [EMAIL PROTECTED] 08/23/2005 05:22 PM Please respond to nutch-user@lucene.apache.org To nutch-user@lucene.apache.org cc Subject Adding small batches of fetched URLs to a larger aggregate segment/index Hi, I have a number of sites that I want to crawl, then merge their segments and create a single index. One of the main reasons I want to do this is that I want some of the sites in my index to be crawls on a daily basis, others on a weekly basis, etc. Each time I re-crawl a site, I want to add the fetched URLs to a single aggregate segment/index. I have a couple questions about doing this: 1. Is it possible to use a different regex.urlfilter.txt file for each site that I am crawling? If so, how would I do this? 2. If I have a very large segment that is indexed (my aggregate index) and I want to add another (much smaller) set of fetched URLs to this index, what is the best way to do this. It seems like merging the small and large segments and then re-indexing the whole thing would be very time consuming -- especially if I wanted to add news small sets of fetched URLs frequently. Thanks for any suggestions you have to offer, Bryan
Re: PDF support? Does crawl parse p
Hello Diane, There is a plugin to parse pdf files. You have to enable it in nutch-site.xml (just copy entry from nutch-default.xml). You have to change plugin.includes property to include parse-pdf plugin: [...] parse-(text|html) [...] to [...] parse-(text|html|pdf) [...] Regards Piotr Diane Palla wrote: Does Nutch have a way to parse pdf files, that is, application/pdf content type files? I noticed a plugin variable setting in default.properties: plugin.pdf=org.apache.nutch.parse.pdf* I never changed this file. Is that the right value? I am using Nutch 0.7. What do I have to do make parse pdf files? When I do the crawl, I get this error with application/pdf files: 050831 145126 fetch okay, but can't parse mainurl/research/126900/126969/126969.pdf, reason: failed(2,203): Content-Type not text/html: application/pdf If it's not possible, what future version of Nutch do developers expect to support application/pdf types and have such parsing of pdf files available? Diane Palla Web Services Developer Seton Hall University 973 313-6199 [EMAIL PROTECTED] Bryan Woliner [EMAIL PROTECTED] 08/23/2005 05:22 PM Please respond to nutch-user@lucene.apache.org To nutch-user@lucene.apache.org cc Subject Adding small batches of fetched URLs to a larger aggregate segment/index Hi, I have a number of sites that I want to crawl, then merge their segments and create a single index. One of the main reasons I want to do this is that I want some of the sites in my index to be crawls on a daily basis, others on a weekly basis, etc. Each time I re-crawl a site, I want to add the fetched URLs to a single aggregate segment/index. I have a couple questions about doing this: 1. Is it possible to use a different regex.urlfilter.txt file for each site that I am crawling? If so, how would I do this? 2. If I have a very large segment that is indexed (my aggregate index) and I want to add another (much smaller) set of fetched URLs to this index, what is the best way to do this. It seems like merging the small and large segments and then re-indexing the whole thing would be very time consuming -- especially if I wanted to add news small sets of fetched URLs frequently. Thanks for any suggestions you have to offer, Bryan
Re: metadata support in WebDB (Stefan's NUTCH-59 patch)
Hi Otis, http://issues.apache.org/jira/browse/NUTCH-59 This patch looks interesting for my Nutch needs, So please vote for the patch if you like it. :-) I can't look at the code, but looking at your diff, it looks like this metadata would be stored somewhere inside Nutch's WebDB, and that one would have to provide this metadata to Nutch during URL injection is this correct? Yes, meta data are part of the page object and stored in the webdb. You can add metadata in any situation you maintain this page object. So you can have a custom injector as you describe to set meta data, but more interesting you can set them until fetch time as well or in any situation you have access to the page object. (e.g. segment generation, dbupdated etc.) I currently have this little wrapper method around a few of Nutch's tool classes (below). I first generate a plain-text file with all URLs I want to fetch, then I call the method below, and then I just call Fetcher.main(...). If I want to associate some metadata with each URL to be fetched, where would I insert it into the system? Would I need my own injector class with my own addPage method that pulls metadata in (from some external storage) for each URL it gets, and call dbWriter.addPageIfNotPresent(page) like WebDBInjector does with DMOZ data? Yes. I personal suggest create a extension point for the injector. May other people find that interesting as well and you can contribute this extension point. :) Write a small plugin that lookup the meta data you plan to add from a mysql-db or so and add them to the page object. That's it. You can do very interesting things until the life-cycle of the page object. For example generate metadata from html content, fetch time (more intelligent fetchnig), or hand over meta data from pages to links etc. Keep in mind that you can have other storage types than the actually existing map for storing meta data. For example you can implement a StringArray or other datatypes. In general I would love to see more interest in this patch and some votes since I think that such meta data can brings a lot of new possible features to nutch. The very interesting part is that if you do not use meta data the web db is not blowed up and that this patch does not slow down the web db processing speed. Greetings, Stefan