Re: Please, unsubscribe me
You are doomed to read about Nutch to the very ends of your existence, people. 2009/10/29 Le Manh Cuong cuong...@gmail.com: Sorry but the last time I try to unsubscribe, It don’t work. And now it don’t work also, :). -Original Message- From: SunGod [mailto:sun...@cheemer.org] Sent: Thursday, October 29, 2009 10:09 AM To: nutch-user@lucene.apache.org Subject: Re: Please, unsubscribe me List-Help: mailto:nutch-user-h...@lucene.apache.org List-Unsubscribe: mailto:nutch-user-unsubscr...@lucene.apache.org List-Post: mailto:nutch-user@lucene.apache.org List-Id: nutch-user.lucene.apache.org 2009/10/29 Le Manh Cuong cuong...@gmail.com Me too, Could you please help to remove me (cuong09m @gmail.com) from the nutch and hadoop mail list? -Original Message- From: caoyuzhong [mailto:caoyuzh...@hotmail.com] Sent: Thursday, October 29, 2009 9:49 AM To: nutch-user@lucene.apache.org Subject: RE: Please, unsubscribe me The unsubscription message does not work for me too. Could you please help to remove me (caoyuzh...@hotmail.com) from the nutch and hadoop mail list? Subject: Please, unsubscribe me From: nsa...@officinedigitali.it To: nutch-user@lucene.apache.org Date: Wed, 28 Oct 2009 16:43:05 +0100 Hi, the unsubscription message doesn't work. Please, remove me from the list. Thanks. _ 全新 Windows 7:寻找最适合您的 PC。了解详情。 http://www.microsoft.com/china/windows/buy/
Re: Authenticity of URLs from DMOZ
Gaurang, About that AVG alerts - you are fetching web pages together with all viruses they may be infected with. Of course, antivirus software will scream about it. I wouldn't run any kind of such software on crawling machine. პატივისცემით, დავით ჯაში On Tue, Oct 6, 2009 at 12:36, Gaurang Patel gaurangtpa...@gmail.com wrote: Hey, Can anyone tell what could be the reason for following which happened while fetching data using bin/nutch fetch: My AVG Antivirus is detecting virus threats while Nutch fetches pages from available urls of *crawldb.* I injected DMOZ Open Directory urls to crawldb. Antivirus already detected 4 threats within only half an hour after start of fetching. Is there any other way(any source other than DMOZ) to get list of whole web urls ? Or is there an automatic way to avoid such harrmful urls from being fetched? Let me know asap. Regards, Gaurang
Re: Multilanguage support in Nutch 1.0
On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote: hi try to activate the language-identifier plugin you must add it in the nutch-site.xml file in the nameplugin.includes/name section. Shame on me! Thanks a lot. it's some thing like that property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property From: da...@jashi.ge Date: Tue, 29 Sep 2009 18:59:52 +0400 Subject: Multilanguage support in Nutch 1.0 To: nutch-user@lucene.apache.org Hello, all. I've got a bit of a trouble with Nutch 1.0 and multilanguage support: I have fresh install of Nutch and two analysis plugins I'd like to turn on: analysis-de (German) and analysis-ge (Georgian) Here are the innards of my seed file: --- http://212.72.133.54/l/test.html http://212.72.133.54/l/de.html --- The first is Georgian, other - German. When I run bin/nutch crawl seed -dir crawl -threads 10 -depth 2 there is not a slightest sign of someone calling any analysis plug-ins, even though it's clearly stated in hadoop.log, that they are on and active: --- 2009-09-29 16:39:13,328 INFO crawl.Crawl - crawl started in: crawl 2009-09-29 16:39:13,328 INFO crawl.Crawl - rootUrlDir = seed 2009-09-29 16:39:13,328 INFO crawl.Crawl - threads = 10 2009-09-29 16:39:13,328 INFO crawl.Crawl - depth = 2 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: starting 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: urlDir: seed 2009-09-29 16:39:13,390 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2009-09-29 16:39:13,421 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-09-29 16:39:15,546 INFO plugin.PluginRepository - Plugins: looking in: C:\cygwin\opt\nutch\plugins 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Registered Plugins: 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Lucene Analysers (lib-lucene-analyzers) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Language Identification Parser/Filter (language-identifier) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) ! 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Georgian Analysis Plug-in (analysis-ge) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - German Analysis Plug-in (analysis-de) ! 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Site Query Filter (query-site) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - More Query Filter (query-more) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2009-09-29
Re: Multilanguage support in Nutch 1.0
On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote: hi try to activate the language-identifier plugin you must add it in the nutch-site.xml file in the nameplugin.includes/name section. Ooops. It IS activated. 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Language Identification Parser/Filter (language-identifier) But fetched pages are not passed to it, as I recon.
Re: graphical user interface v0.2 for nutch
Any documentation on how to add this GUI to existing NUtch instance? პატივისცემით, დავით ჯაში 2009/9/30 Bartosz Gadzimski bartek...@o2.pl: Hello, First - great job, it looks and works very nice. I have a question about urlfilters. Is this possible to get regex-urlfilter per instance (different for each instance) ? Also what for is nutch-gui/conf/regex-urlfilter.txt file ? Feature request - option to merge segments or maybe removing old one ? Thanks, Bartosz
Re: graphical user interface v0.2 for nutch
Thanks, Sorry for my bad English, I`ll rephrase: Can I add this GUI to existing Nutch installation? I've made some modifications to mine, so starting from scratch would be quite time-consuming. პატივისცემით, დავით ჯაში On Wed, Sep 30, 2009 at 18:19, Marko Bauhardt m...@101tec.com wrote: Hi David. sorry i dont understand your question. documentation about the nutch gui can you find here http://wiki.github.com/101tec/nutch marko On Sep 30, 2009, at 4:02 PM, David Jashi wrote: Any documentation on how to add this GUI to existing NUtch instance? პატივისცემით, დავით ჯაში 2009/9/30 Bartosz Gadzimski bartek...@o2.pl: Hello, First - great job, it looks and works very nice. I have a question about urlfilters. Is this possible to get regex-urlfilter per instance (different for each instance) ? Also what for is nutch-gui/conf/regex-urlfilter.txt file ? Feature request - option to merge segments or maybe removing old one ? Thanks, Bartosz
Re: graphical user interface v0.2 for nutch
That's 1.0 Thanks a lot. I'll give it a try. პატივისცემით, დავით ჯაში On Wed, Sep 30, 2009 at 18:37, Marko Bauhardt m...@101tec.com wrote: Sorry for my bad English, I`ll rephrase: :) No Problem. Can I add this GUI to existing Nutch installation? I've made some modifications to mine, so starting from scratch would be quite time-consuming. Ah ok understand. Hm. The gui is forked from the release-1.0 tag. what for nutch version you have patched? you can try to make a diff on the release-1.0 to create a patch file. after that you can checkout or download the gui and try to apply your patch. maybe this could work. marko პატივისცემით, დავით ჯაში On Wed, Sep 30, 2009 at 18:19, Marko Bauhardt m...@101tec.com wrote: Hi David. sorry i dont understand your question. documentation about the nutch gui can you find here http://wiki.github.com/101tec/nutch marko On Sep 30, 2009, at 4:02 PM, David Jashi wrote: Any documentation on how to add this GUI to existing NUtch instance? პატივისცემით, დავით ჯაში 2009/9/30 Bartosz Gadzimski bartek...@o2.pl: Hello, First - great job, it looks and works very nice. I have a question about urlfilters. Is this possible to get regex-urlfilter per instance (different for each instance) ? Also what for is nutch-gui/conf/regex-urlfilter.txt file ? Feature request - option to merge segments or maybe removing old one ? Thanks, Bartosz
Multilanguage support in Nutch 1.0
Hello, all. I've got a bit of a trouble with Nutch 1.0 and multilanguage support: I have fresh install of Nutch and two analysis plugins I'd like to turn on: analysis-de (German) and analysis-ge (Georgian) Here are the innards of my seed file: --- http://212.72.133.54/l/test.html http://212.72.133.54/l/de.html --- The first is Georgian, other - German. When I run bin/nutch crawl seed -dir crawl -threads 10 -depth 2 there is not a slightest sign of someone calling any analysis plug-ins, even though it's clearly stated in hadoop.log, that they are on and active: --- 2009-09-29 16:39:13,328 INFO crawl.Crawl - crawl started in: crawl 2009-09-29 16:39:13,328 INFO crawl.Crawl - rootUrlDir = seed 2009-09-29 16:39:13,328 INFO crawl.Crawl - threads = 10 2009-09-29 16:39:13,328 INFO crawl.Crawl - depth = 2 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: starting 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: urlDir: seed 2009-09-29 16:39:13,390 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2009-09-29 16:39:13,421 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-09-29 16:39:15,546 INFO plugin.PluginRepository - Plugins: looking in: C:\cygwin\opt\nutch\plugins 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Registered Plugins: 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Lucene Analysers (lib-lucene-analyzers) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Language Identification Parser/Filter (language-identifier) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) ! 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Georgian Analysis Plug-in (analysis-ge) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - German Analysis Plug-in (analysis-de) ! 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Site Query Filter (query-site) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - More Query Filter (query-more) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - URL Query Filter (query-url) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) --- At the same time: --- 2009-09-29 16:39:54,406 INFO lang.LanguageIdentifier - Language identifier configuration [1-4/2048] 2009-09-29 16:39:54,609 INFO lang.LanguageIdentifier - Language identifier plugin supports: it(1000) is(1000) hu(1000) th(1000) sv(1000) ge(1000) fr(1000) ru(1000) fi(1000) es(1000) en(1000) el(1000) ee(1000) pt(1000) de(1000) da(1000) pl(1000) no(1000) nl(1000) --- Language indentifier works as a charm at the same time: --- $ bin/nutch plugin language-identifier org.apache.nutch.analysis.lang.LanguageIdentifier -identifyurl http://212.72.133.54/l/test.html text was identified as ge --- $ bin/nutch plugin language-identifier org.apache.nutch.analysis.lang.LanguageIdentifier -identifyurl http://212.72.133.54/l/de.html text was identified as de --- What could have possibly gone wrong? პატივისცემით, დავით ჯაში
Fwd: Release 1.0?
David Jashi wrote: Wow. Does it mean we'll have live indexing out of the box? If by live you mean that you can index a fetched parsed segment, and have it appear immediately in live search after you commit, then yes. Other than that, Nutch still uses segments as a unit of work, so the segment generation / fetch / parsing / updatedb etc. are still batch operations that take time. Yes, that's what I meant. Very nice. By the way, is there any chance to modify stemming to process several wordforms (tokens) at once, and not one-by one? That would really increase speed of my external stemming. You can implement your own analyzer, which first caches all tokens from TokenStream, and then passes them all at once to the external process. Thanks for the hint, I'll dig in that direction. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- with best regards, David Jashi Web development EO, Caucasus Online +995(32)970368 da...@jashi.ge პატივისცემით, დავით ჯაში ვებ–განვითარების დირექტორი კავკასუს ონლაინი +995(32)970368 da...@jashi.ge
Re: Release 1.0?
Wow. Does it mean we'll have live indexing out of the box? By the way, is there any chance to modify stemming to process several wordforms (tokens) at once, and not one-by one? That would really increase speed of my external stemming. 2009/2/2 Tony Wang ivyt...@gmail.com: I definitely like the Nutch/Solr integration the best! Thanks guys! -- with best regards, David Jashi
Stemmer
Hello, everyone. Is there any chance to make Nutch call stemmer in batch? That is, give him not a single word (token), but array of words. My stemmer has external parts, called by HTTP request, so you can imagine, what performance overhead I have. -- with best regards, David Jashi Web development EO, Caucasus Online +995(32)970368 da...@jashi.ge პატივისცემით, დავით ჯაში ვებ–განვითარების დირექტორი კავკასუს ონლაინი +995(32)970368 da...@jashi.ge
Re: Nutch Training Seminar
If it's over the web - I'm in. On Mon, Dec 1, 2008 at 3:23 AM, Windflying [EMAIL PROTECTED] wrote: I'm interested. Cheeres, -Original Message- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: Monday, 1 December 2008 2:51 AM To: nutch-user@lucene.apache.org Subject: Re: Nutch Training Seminar Ok. Seems like a lot of people are interested. I will put something together and keep everyone up to date. Thanks to everyone who responded. Dennis Dennis Kubes wrote: Would anybody be interested in a Nutch training seminar that goes over the following: 1) Installing and configuration of Nutch 2) Crawling the web, the CrawlDb, and URL filters 3) Parsing and Parse filters 4) Nutch plugins and plugin architecture 5) Analysis, Link analysis, and scoring 6) Indexing and custom fields 7) Deployment, shard architecture 8) Writing custom tools for Nutch 9) Hadoop architecture Are there other things people would want to go over? Dennis -- with best regards, David Jashi Web development EO, Caucasus Online +995(32)970368 [EMAIL PROTECTED] პატივისცემით, დავით ჯაში ვებ–განვითარების დირექტორი კავკასუს ონლაინი +995(32)970368 [EMAIL PROTECTED]
Re: Language Analysis Plugins
That would be nice On Wed, Nov 26, 2008 at 7:31 PM, Dennis Kubes [EMAIL PROTECTED] wrote: For the 1.0 release would everyone like the new language analysis plugins for different languages activated by default. Currently they language analysis plugins are not activated by default. We are adding 8 new languages and 7 new plugins. Dennis -- with best regards, David Jashi Web development EO, Caucasus Online +995(32)970368 [EMAIL PROTECTED] პატივისცემით, დავით ჯაში ვებ–განვითარების დირექტორი კავკასუს ონლაინი +995(32)970368 [EMAIL PROTECTED]
Re: Lost regrading Stemming in nutch
I managed to connect Nutch 0.9 to my stemming machine. Don't know if my approach would work on 0.8.1 On Wed, Oct 29, 2008 at 10:56 PM, jcze [EMAIL PROTECTED] wrote: Hi, i'm using nutch 0.8.1, I'm lost about the stemming of nutch, tried the wiki on MultiLingual Support. coz it said that it could stem the words.. hmm.. but I'm lost because it said that I need to modify the IndexSegment class which i couldnt find.. =( Anywayz, i tried the stemming for nutch 8.. but i'm lost again.. so it didnt work again.. =( I need some guidance and help.. really really lost =(( -- View this message in context: http://www.nabble.com/Lost-regrading-Stemming-in-nutch-tp20233602p20233602.html Sent from the Nutch - User mailing list archive at Nabble.com. -- with best regards, David Jashi Web development EO, Caucasus Online +995(32)970368 [EMAIL PROTECTED] პატივისცემით, დავით ჯაში ვებ–განვითარების დირექტორი კავკასუს ონლაინი +995(32)970368 [EMAIL PROTECTED]
Re: encoding
ყველაფერი რიგზეა, utf-8 მაგივრად nutch რამოღაც 16–ბიტიანს აბრუნებს. It's OK, for some strange reason Nutch uses this encoding instead of UTF-8. Text is displayed normally anyhow. On Mon, Sep 29, 2008 at 1:04 PM, daut [EMAIL PROTECTED] wrote: hello, I've installed nutch-0.9 and made first crawling.Then I've made search on search page. Everithing seems ok. I can see all result characters correctly. (non ASCI characters, Georgian language). But when I view page source, Instead of georgian letters, for example პოლ, there are such simbols:_#_4_3_1_8;_#_4_3_1_7;_#_4_3_1_4;.(without _ simbols :) ) Why happens this? Is it normal? Best Rgds daut. -- View this message in context: http://www.nabble.com/encoding-tp19720443p19720443.html Sent from the Nutch - User mailing list archive at Nabble.com. -- with best regards, David Jashi Web development EO, Caucasus Online +995(32)970368 [EMAIL PROTECTED] პატივისცემით, დავით ჯაში ვებ–განვითარების დირექტორი კავკასუს ონლაინი +995(32)970368 [EMAIL PROTECTED]
Re: encoding
It's definitely Tomcat. I just browsed through segments/*/content/part-*/data files with hex viewer and it looks like Nutch uses some sort of compression. 2008/9/29 daut [EMAIL PROTECTED]: I want to use utf-8. How can I force nutch to use utf-8? Or is it tomcat issue? David Jashi wrote: ყველაფერი რიგზეა, utf-8 მაგივრად nutch რამოღაც 16–ბიტიანს აბრუნებს. It's OK, for some strange reason Nutch uses this encoding instead of UTF-8. Text is displayed normally anyhow. On Mon, Sep 29, 2008 at 1:04 PM, daut [EMAIL PROTECTED] wrote: hello, I've installed nutch-0.9 and made first crawling.Then I've made search on search page. Everithing seems ok. I can see all result characters correctly. (non ASCI characters, Georgian language). But when I view page source, Instead of georgian letters, for example პოლ, there are such simbols:_#_4_3_1_8;_#_4_3_1_7;_#_4_3_1_4;.(without _ simbols :) ) Why happens this? Is it normal? Best Rgds daut. -- View this message in context: http://www.nabble.com/encoding-tp19720443p19720443.html Sent from the Nutch - User mailing list archive at Nabble.com. -- with best regards, David Jashi Web development EO, Caucasus Online +995(32)970368 [EMAIL PROTECTED] პატივისცემით, დავით ჯაში ვებ–განვითარების დირექტორი კავკასუს ონლაინი +995(32)970368 [EMAIL PROTECTED] -- View this message in context: http://www.nabble.com/encoding-tp19720443p19721356.html Sent from the Nutch - User mailing list archive at Nabble.com. -- with best regards, David Jashi Web development EO, Caucasus Online +995(32)970368 [EMAIL PROTECTED] პატივისცემით, დავით ჯაში ვებ–განვითარების დირექტორი კავკასუს ონლაინი +995(32)970368 [EMAIL PROTECTED]
Re: pages with duplicate content in search results
Sorry for off-topic, but how do you make Nutch-0.9 search multiple indexes? On Thu, Sep 25, 2008 at 4:42 PM, Dennis Kubes [EMAIL PROTECTED] wrote: If you are using more than one index then dedup will not work across indexes. A single index should dedup correctly unless the pages are not exact duplicates but near duplicates. The dedup process works on url and byte hash. If the content is even 1 byte different, it doesn't work. Near duplicate detection is another set of algorithms that haven't been implemented in Nutch yet. On the query site you can set hte hitsPerSite to 1 and it should limit your search results. Dennis Edward Quick wrote: Hi, Eventhough I ran nutch dedup on my index, I still have pages with different urls but the exactly the same content (see search result example below). From what I read up on dedup this shouldn't happen though as it deletes the url with the lowest score. Is there anything else I can try to get rid of these? Thanks, Ed. Item Document :- Client - TeraTerm Pro ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards Online Employee Self Service ESS Home ... Description Document Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where printing or keymapping is an issue, TeraTerm ... http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument (cached) (explain) (anchors) Item Document :- Client - TeraTerm Pro ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards Online Employee Self Service ESS Home ... Description Document Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where printing or keymapping is an issue, TeraTerm ... http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument (cached) (explain) (anchors) _ Make a mini you and download it into Windows Live Messenger http://clk.atdmt.com/UKM/go/111354029/direct/01/ -- with best regards, David Jashi Web development EO, Caucasus Online +995(32)970368 [EMAIL PROTECTED] პატივისცემით, დავით ჯაში ვებ–განვითარების დირექტორი კავკასუს ონლაინი +995(32)970368 [EMAIL PROTECTED]
Re: where to find the location of rss feed
Hello, Arun. The easiest way to get RSS file is to right click on the RSS link in the lower right corner of Nutch search results, and choose Save Target As... in pop-up menu. The address should be something like: http://hostname/nutch-0.9/search.jsp?query=testhitsPerPage=10lang=en On Sat, Sep 20, 2008 at 7:37 AM, Arun Kamal [EMAIL PROTECTED] wrote: hi all, m a newbie in nutch trying to see the way the rss is stored. i couldnt find a way of getting to the rss feed file. i want to see this to understand the way this rss is stored. plz help me. thx in advance, Arun Kamal -- View this message in context: http://www.nabble.com/where-to-find-the-location-of-rss-feed-tp19582613p19582613.html Sent from the Nutch - User mailing list archive at Nabble.com. -- with best regards, David Jashi Web development EO, Caucasus Online +995(32)970368 [EMAIL PROTECTED] პატივისცემით, დავით ჯაში ვებ–განვითარების დირექტორი კავკასუს ონლაინი +995(32)970368 [EMAIL PROTECTED]
Re: Dedup
Thanks, Andrzej. In fact I was meaning DD/MM/. Anyway, knowing, that dedup is keeping latest version of file makes my life a bit easier. On Fri, Sep 19, 2008 at 12:35 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: Isn't he in fact NOT using the US date notation? AFAIK, the US date notation is mm/dd/. Hehe, you are right - the joke is on me - the use of slashes misled me. Still my answer holds - dedup will keep just the latest version of the page. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- with best regards, David Jashi Web development EO, Caucasus Online +995(32)970368 [EMAIL PROTECTED] პატივისცემით, დავით ჯაში ვებ–განვითარების დირექტორი კავკასუს ონლაინი +995(32)970368 [EMAIL PROTECTED]
Dedup
Hello, colleagues. I have a theoretical question - let's say on 01/01/2008 we have crawled page http://www.site.com/page.html on 10/01/2008 the page changed on 01/02/2008 we crawled it once again and merged old and new indexes which version of this page Nutch dedup will leave in index? -- with best regards, David Jashi Web development EO, Caucasus Online +995(32)970368 [EMAIL PROTECTED] პატივისცემით, დავით ჯაში ვებ–განვითარების დირექტორი კავკასუს ონლაინი +995(32)970368 [EMAIL PROTECTED]
Problems with highlighter
Hello, I've implemented GeorgianStemmer for nutch, even modified source of BasicQueryFilter.java to use stems for search. Here is modified part of source code as follows: private org.apache.lucene.search.Query exactPhrase(Phrase nutchPhrase, String field, float boost){ Term[] terms = nutchPhrase.getTerms(); PhraseQuery exactPhrase = new PhraseQuery(); for (int i = 0; i terms.length; i++) { exactPhrase.add(luceneTerm(field, terms[i])); } exactPhrase.setBoost(boost); return exactPhrase; } Everything works, there is only one little cloud one the blue sky of my happiness: damn Highlighter never works when I return results, that contain searched words in different form. To make make it clear (and translating it to English), if I look for watery it finds strings, containing water and watery, but highlights only those, who match search criteria., i.e. watery. Any ideas?
Re: Problems with highlighter
(%D9,\u10E6) .replaceAll(%DA,\u10E7) .replaceAll(%DB,\u10E8) .replaceAll(%DC,\u10E9) .replaceAll(%DD,\u10EA) .replaceAll(%DE,\u10EB) .replaceAll(%DF,\u10EC) .replaceAll(%E0,\u10ED) .replaceAll(%E1,\u10EE) .replaceAll(%E3,\u10EF) .replaceAll(%E4,\u10F0); return b; } private static String recodeEscIKELat( String term ) { String b=term .replaceAll(%C0,a) .replaceAll(%C1,b) .replaceAll(%C2,g) .replaceAll(%C3,d) .replaceAll(%C4,e) .replaceAll(%C5,v) .replaceAll(%C6,z) .replaceAll(%C8,T) .replaceAll(%C9,i) .replaceAll(%CA,k) .replaceAll(%CB,l) .replaceAll(%CC,m) .replaceAll(%CD,n) .replaceAll(%CF,o) .replaceAll(%D0,p) .replaceAll(%D1,J) .replaceAll(%D2,r) .replaceAll(%D3,s) .replaceAll(%D4,t) .replaceAll(%D6,u) .replaceAll(%D7,f) .replaceAll(%D8,q) .replaceAll(%D9,R) .replaceAll(%DA,y) .replaceAll(%DB,S) .replaceAll(%DC,C) .replaceAll(%DD,c) .replaceAll(%DE,Z) .replaceAll(%DF,w) .replaceAll(%E0,W) .replaceAll(%E1,x) .replaceAll(%E3,j) .replaceAll(%E4,h); return b; } public GeorgianStemmer() { } /** * Stemms the given term to an unique ttdiscriminator/tt. * * @param word The term that should be stemmed. * @return Discriminator for ttterm/tt */ protected String stem( String term ) { String stem = term; String instring; try { URL url = new URL(http://127.0.0.1:8042/?+encodeIKE(term)); // Create the URL URLConnection yc = url.openConnection(); BufferedReader in = new BufferedReader( new InputStreamReader( yc.getInputStream())); while ((instring = in.readLine()) != null) { stem = recodeEscIKE(instring); } in.close(); } catch (MalformedURLException e) { return term; } catch (IOException e) { return term; } if ((stem!=null) (stem.charAt(0)!='_') (stem.charAt(0)!='U')) {return stem;} else {return term;} } public static void main(String args[]) { GeorgianStemmer stemmer = new GeorgianStemmer(); System.out.println(stemmer.stem(args[1])); } } On Fri, Sep 12, 2008 at 1:34 PM, Lyndon Maydwell [EMAIL PROTECTED] wrote: I'd be happy to try trawl through the code for you :) I've been looking for stemming code that will run on 1.0 for ages now! -- with best regards, David Jashi Web development EO, Caucasus Online +995(32)970368 [EMAIL PROTECTED] პატივისცემით, დავით ჯაში ვებ–განვითარების დირექტორი კავკასუს ონლაინი +995(32)970368 [EMAIL PROTECTED]
Re: intranet crawling
It may be a rude decision of that problem, but when I wanted ALL of my video hosting site indexed, I simply generated list like http://tvali.ge/index.php?action=watchv=495 http://tvali.ge/index.php?action=watchv=496 http://tvali.ge/index.php?action=watchv=497 from MySQL table, containing list of posts and put it into urls dir. On Thu, Sep 4, 2008 at 6:56 PM, Edward Quick [EMAIL PROTECTED] wrote: Hi, I want to do an exhaustive scan of our intranet but running bin/nutch crawl urls -dir crawl -depth 9 -topN 50 doesn't get everything. I've increased this now to bin/nutch crawl urls -dir crawl -depth 30 -topN 1000 and it's certainly running longer but I'm not sure if this will still miss any pages. Is there any way of doing this so I get an index of the whole intranet? Thanks, Ed. _ Win New York holidays with Kellogg's Live Search http://clk.atdmt.com/UKM/go/111354033/direct/01/ -- with best regards, David Jashi Web development EO, Caucasus Online +995(32)970368 [EMAIL PROTECTED] პატივისცემით, დავით ჯაში ვებ–განვითარების დირექტორი კავკასუს ონლაინი +995(32)970368 [EMAIL PROTECTED]
Re: problems: crawling specific domain
Ever tried to use this one: http://wiki.apache.org/nutch/Nutch_0%2e9_Crawl_Script_Tutorial ? About single site crawl: http://peterpuwang.googlepages.com/NutchGuideForDummies.htm , part 4. On Wed, Sep 3, 2008 at 8:53 AM, Mohammad Monirul Hoque [EMAIL PROTECTED] wrote: Hi, How can i crawl specific domain only(like www.yellowpages.co.za)? What i have to change to work things correctly?I tried with the change in crawl-urlfilter.txt and nutch started crawling outside my domain after sometimes. I am using nutch 0.9 in standalone mode(without hadoop).Can anyone gives me some idea how to merge indexes from different crawl to a single indexes? Regards. --mohammad monirul hoque -- with best regards, David Jashi Web development EO, Caucasus Online +995(32)970368 [EMAIL PROTECTED] პატივისცემით, დავით ჯაში ვებ–განვითარების დირექტორი კავკასუს ონლაინი +995(32)970368 [EMAIL PROTECTED]