Re: AW: Null Indexing
hai, i am getting the following error while running the crawler by bin/nutch crawl urls -dir crawl_NEW1 -depth 3 -topN 50 Dedup: adding indexes in: crawl_NEW1/indexes Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java :439) at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) Can anyone help me to clear this problem. -N.Mehala. On Wed, Sep 23, 2009 at 10:14 AM, Cisek faust...@mailinator.com wrote: I had the same little big problem - everything seemed OK: - bin/nutch org.apache.nutch.searcher.NutchBean search query ... [in my case search query = apache] in cygwin returns 62 Total hits on cawled +^http://([a-z0-9]*\.)*apache.org/ - Nutch in Tomcat webapp after deploy seemed fine (no errors) - I had NOT created a new xml file named nutch-0.9.xml which contains Context path=/nutch-0.9/ debug=5 privileged=true docBase=C:\nutch-0.9/ and NOT put it in C:\Tomcat6.0\conf\Catalina\localhost like Ramadhany had - but still got Hits 0-0 (out of about 0 total matching pages): in Tomcat-Nutch web interface. ... but I have solved it in my case: - I forgott to configure the searcher.dir in nutch-site.xml at C:\Tomcat6.0\webapps\nutch-0.9\WEB-INF\classes like in http://wiki.apache.org/nutch/GettingNutchRunningWithWindows http://wiki.apache.org/nutch/GettingNutchRunningWithWindows - Set Your Searcher Directory - and now it works fine - Tomcat-Nutch interface returns 62 matching pages :) Imam Nur Ramadhany wrote: Hello again everyone, My detail configuration is just like what http://wiki.apache.org/nutch/GettingNutchRunningWithWindows said. I'm new to Tomcat and Java, so I just followed the instruction. I extracted the release at C:\nutch-0.9, made a directory named urls with a file also named urls (without extention), then added the URLs to the crawl-urlfilter.txt (C:\nutch-0.9\conf\crawl-urlfilter.txt). I also have crawled a site (http://localhost/). For web interface search I uploaded the nutch WAR file. And created a new xml file named nutch-0.9.xml which contains Context path=/nutch-0.9/ debug=5 privileged=true docBase=C:\nutch-0.9 / and put it in C:\Tomcat6.0\conf\Catalina\localhost, I think there where my problems are. Is it the correct path and docbase? When I enter http://localhost:8080/nutch-0.9/ there is a welcome page but when I put a query and click the search it wasn't returned any hit (Hits 0-0 (out of about 0 total matching pages):). I also have configured the searcher.dir in nutch-site.xml at C:\Tomcat6.0\webapps\nutch-0.9\WEB-INF\classes anyway. Then like Koch Martina's suggestion I tried to search directly from the command line in cygwin by the command: bin/nutch org.apache.nutch.searcher.NutchBean search query. It works. I'm still working on the nutch-0.9.xml to make the webapp works, trying some path and docbase. But it would be helpful if you have any other suggestions. Thanks in advance,Ramadhany From: Imam Nur Ramadhany ramadhanyov...@yahoo.com To: nutch-user@lucene.apache.org Sent: Tuesday, January 13, 2009 7:27:21 AM Subject: Re: AW: Null Indexing Thanks for your info Martina, it works with the command line but it doesn't when using the webapp (localhost:8080/nutch-0.9) is it enough with only deploy the war file using Tomcat manager? or should we include some other file to the catalina_home? From: Koch Martina k...@huberverlag.de To: nutch-user@lucene.apache.org nutch-user@lucene.apache.org Sent: Friday, January 9, 2009 2:57:24 PM Subject: AW: Null Indexing Hi Ramadhany, the mentioned warnings and fatals you see in the log have nothing to do with getting 0 results at searching. The fatal message can be eliminated by setting the property http.robots.agents in the nutch-site.xml to Imam Spider,*. The urlnormalizer warn messages just inform you that you have not specified a dedicated urlnormalizer for a certain scope so that the default urlnormalizer is used. If you need more information on this, look at URLNormalizers.java (package org.apache.nutch.net). To narrow down your searching problems, please provide some more details on your configuration. Did you check the content of your index using Luke (http://www.getopt.org/luke/) to make sure that the pages and content you are expecting in the index are really in there? Did you try a search directly from the command line in cygwin by the command: bin/nutch org.apache.nutch.searcher.NutchBean search query Kind regards, Martina -Ursprüngliche Nachricht- Von: Imam Nur Ramadhany [mailto:ramadhanyov...@yahoo.com] Gesendet: 09 January 2009 01:39 An: nutch-user@lucene.apache.org Betreff: Null Indexing I'm new to Nutch, I try to deploy nutch-0.9 but still having some problem. when I try to
Re: R: Using Nutch for only retriving HTML
Actually its quite easy to modify the parse-html filter to do this. That is saving the HTML to a file or to some database, you could then configure it to skip all unnecessary plugins. I think it depends a lot on the other requirements you have whether using nutch for this task is the right way to go or not. If you can get by with wget -r then its probably an overkill to use nutch. Best regards, Magnus On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal susam@gmail.com wrote: On Wed, Sep 30, 2009 at 1:39 AM, O. Olson olson_...@yahoo.it wrote: Sorry for pushing this topic, but I would like to know if Nutch would help me get the raw HTML in my situation described below. I am sure it would be a simple answer to those who know Nutch. If not then I guess Nutch is the wrong tool for the job. Thanks, O. O. --- Gio 24/9/09, O. Olson olson_...@yahoo.it ha scritto: Da: O. Olson olson_...@yahoo.it Oggetto: Using Nutch for only retriving HTML A: nutch-user@lucene.apache.org Data: Giovedì 24 settembre 2009, 20:54 Hi, I am new to Nutch. I would like to completely crawl through an Internal Website and retrieve all the HTML Content. I don’t intend to do further processing using Nutch. The Website/Content is rather huge. By crawl, I mean that I would go to a page, download/archive the HTML, get the links from that page, and then download/archive those pages. I would keep doing this till I don’t have any new links. I don't think it is possible to retrieve pages and store them as separate files, one per page, without modifications in Nutch. I am not sure though. Someone would correct me if I am wrong here. However, it is easy to retrieve the HTML contents from the crawl DB using the Nutch API. But from your post, it seems, you don't want to do this. Is this possible? Is this the right tool for this job, or are there other tools out there that would be more suited for my purpose? I guess 'wget' is the tool you are looking for. You can use it with -r option to recursively download pages and store them as separate files on the hard disk, which is exactly what you need. You might want to use the -np option too. It is available for Windows as well as Linux. Regards, Susam Pal
Re: Multilanguage support in Nutch 1.0
On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote: hi try to activate the language-identifier plugin you must add it in the nutch-site.xml file in the nameplugin.includes/name section. Shame on me! Thanks a lot. it's some thing like that property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property From: da...@jashi.ge Date: Tue, 29 Sep 2009 18:59:52 +0400 Subject: Multilanguage support in Nutch 1.0 To: nutch-user@lucene.apache.org Hello, all. I've got a bit of a trouble with Nutch 1.0 and multilanguage support: I have fresh install of Nutch and two analysis plugins I'd like to turn on: analysis-de (German) and analysis-ge (Georgian) Here are the innards of my seed file: --- http://212.72.133.54/l/test.html http://212.72.133.54/l/de.html --- The first is Georgian, other - German. When I run bin/nutch crawl seed -dir crawl -threads 10 -depth 2 there is not a slightest sign of someone calling any analysis plug-ins, even though it's clearly stated in hadoop.log, that they are on and active: --- 2009-09-29 16:39:13,328 INFO crawl.Crawl - crawl started in: crawl 2009-09-29 16:39:13,328 INFO crawl.Crawl - rootUrlDir = seed 2009-09-29 16:39:13,328 INFO crawl.Crawl - threads = 10 2009-09-29 16:39:13,328 INFO crawl.Crawl - depth = 2 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: starting 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb 2009-09-29 16:39:13,375 INFO crawl.Injector - Injector: urlDir: seed 2009-09-29 16:39:13,390 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2009-09-29 16:39:13,421 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-09-29 16:39:15,546 INFO plugin.PluginRepository - Plugins: looking in: C:\cygwin\opt\nutch\plugins 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Registered Plugins: 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Lucene Analysers (lib-lucene-analyzers) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Language Identification Parser/Filter (language-identifier) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) ! 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Georgian Analysis Plug-in (analysis-ge) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - German Analysis Plug-in (analysis-de) ! 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Site Query Filter (query-site) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - More Query Filter (query-more) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2009-09-29
Re: Multilanguage support in Nutch 1.0
On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote: hi try to activate the language-identifier plugin you must add it in the nutch-site.xml file in the nameplugin.includes/name section. Ooops. It IS activated. 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Language Identification Parser/Filter (language-identifier) But fetched pages are not passed to it, as I recon.
Re: graphical user interface v0.2 for nutch
Hello, First - great job, it looks and works very nice. I have a question about urlfilters. Is this possible to get regex-urlfilter per instance (different for each instance) ? Also what for is nutch-gui/conf/regex-urlfilter.txt file ? Feature request - option to merge segments or maybe removing old one ? Thanks, Bartosz
Re: graphical user interface v0.2 for nutch
On Sep 30, 2009, at 3:47 PM, Bartosz Gadzimski wrote: Hello, Hi Bartosz First - great job, it looks and works very nice. :) Thanks! I have a question about urlfilters. Is this possible to get regex- urlfilter per instance (different for each instance) ? good idea. i think you could configure the property urlfilter.regex.file via the configuration tab per instance. for example an instance fast-crawl use the urlfilter file with name fast-regex-urlfilter.txt and another instance use another name. can you test this? Feature request - option to merge segments or maybe removing old one ? Ok. Sure. You can create a feature request in the gui issue tracker http://oss.101tec.com/jira/browse/NUTCHGUI thanks for testing the gui marko ~~~ 101tec GmbH Halle (Saale), Saxony-Anhalt, Germany http://www.101tec.com
Re: graphical user interface v0.2 for nutch
Any documentation on how to add this GUI to existing NUtch instance? პატივისცემით, დავით ჯაში 2009/9/30 Bartosz Gadzimski bartek...@o2.pl: Hello, First - great job, it looks and works very nice. I have a question about urlfilters. Is this possible to get regex-urlfilter per instance (different for each instance) ? Also what for is nutch-gui/conf/regex-urlfilter.txt file ? Feature request - option to merge segments or maybe removing old one ? Thanks, Bartosz
Re: graphical user interface v0.2 for nutch
Hi David. sorry i dont understand your question. documentation about the nutch gui can you find here http://wiki.github.com/101tec/nutch marko On Sep 30, 2009, at 4:02 PM, David Jashi wrote: Any documentation on how to add this GUI to existing NUtch instance? პატივისცემით, დავით ჯაში 2009/9/30 Bartosz Gadzimski bartek...@o2.pl: Hello, First - great job, it looks and works very nice. I have a question about urlfilters. Is this possible to get regex- urlfilter per instance (different for each instance) ? Also what for is nutch-gui/conf/regex-urlfilter.txt file ? Feature request - option to merge segments or maybe removing old one ? Thanks, Bartosz
Re: graphical user interface v0.2 for nutch
Thanks, Sorry for my bad English, I`ll rephrase: Can I add this GUI to existing Nutch installation? I've made some modifications to mine, so starting from scratch would be quite time-consuming. პატივისცემით, დავით ჯაში On Wed, Sep 30, 2009 at 18:19, Marko Bauhardt m...@101tec.com wrote: Hi David. sorry i dont understand your question. documentation about the nutch gui can you find here http://wiki.github.com/101tec/nutch marko On Sep 30, 2009, at 4:02 PM, David Jashi wrote: Any documentation on how to add this GUI to existing NUtch instance? პატივისცემით, დავით ჯაში 2009/9/30 Bartosz Gadzimski bartek...@o2.pl: Hello, First - great job, it looks and works very nice. I have a question about urlfilters. Is this possible to get regex-urlfilter per instance (different for each instance) ? Also what for is nutch-gui/conf/regex-urlfilter.txt file ? Feature request - option to merge segments or maybe removing old one ? Thanks, Bartosz
Re: graphical user interface v0.2 for nutch
Sorry for my bad English, I`ll rephrase: :) No Problem. Can I add this GUI to existing Nutch installation? I've made some modifications to mine, so starting from scratch would be quite time-consuming. Ah ok understand. Hm. The gui is forked from the release-1.0 tag. what for nutch version you have patched? you can try to make a diff on the release-1.0 to create a patch file. after that you can checkout or download the gui and try to apply your patch. maybe this could work. marko პატივისცემით, დავით ჯაში On Wed, Sep 30, 2009 at 18:19, Marko Bauhardt m...@101tec.com wrote: Hi David. sorry i dont understand your question. documentation about the nutch gui can you find here http://wiki.github.com/101tec/nutch marko On Sep 30, 2009, at 4:02 PM, David Jashi wrote: Any documentation on how to add this GUI to existing NUtch instance? პატივისცემით, დავით ჯაში 2009/9/30 Bartosz Gadzimski bartek...@o2.pl: Hello, First - great job, it looks and works very nice. I have a question about urlfilters. Is this possible to get regex-urlfilter per instance (different for each instance) ? Also what for is nutch-gui/conf/regex-urlfilter.txt file ? Feature request - option to merge segments or maybe removing old one ? Thanks, Bartosz
Re: Specify at least one source--a file or resource collection error
I´ve solved this problem using ant 1.6.5 instead of 1.7 El 29 de septiembre de 2009 12:18, Jaime Martín james...@gmail.comescribió: Hi again: I just want to be able to build nucth in eclipse. What version do you use? Is last official release 1.0 not advisable? any plugin or reliable svn version required? thank you very much. El 23 de septiembre de 2009 15:40, Jaime Martín james...@gmail.comescribió: Hi: I´m following the steps to run Nucth 1.0 release with Eclipse and Windows described in this link http://wiki.apache.org/nutch/RunNutchInEclipse1.0 I´m trying to build it, but when I launch the war target I have this error C:\ECLIPSE321\workspace\nutch-1.0\build.xml:62: Specify at least one source--a file or resource collection. any tip? thank you!
Re: graphical user interface v0.2 for nutch
That's 1.0 Thanks a lot. I'll give it a try. პატივისცემით, დავით ჯაში On Wed, Sep 30, 2009 at 18:37, Marko Bauhardt m...@101tec.com wrote: Sorry for my bad English, I`ll rephrase: :) No Problem. Can I add this GUI to existing Nutch installation? I've made some modifications to mine, so starting from scratch would be quite time-consuming. Ah ok understand. Hm. The gui is forked from the release-1.0 tag. what for nutch version you have patched? you can try to make a diff on the release-1.0 to create a patch file. after that you can checkout or download the gui and try to apply your patch. maybe this could work. marko პატივისცემით, დავით ჯაში On Wed, Sep 30, 2009 at 18:19, Marko Bauhardt m...@101tec.com wrote: Hi David. sorry i dont understand your question. documentation about the nutch gui can you find here http://wiki.github.com/101tec/nutch marko On Sep 30, 2009, at 4:02 PM, David Jashi wrote: Any documentation on how to add this GUI to existing NUtch instance? პატივისცემით, დავით ჯაში 2009/9/30 Bartosz Gadzimski bartek...@o2.pl: Hello, First - great job, it looks and works very nice. I have a question about urlfilters. Is this possible to get regex-urlfilter per instance (different for each instance) ? Also what for is nutch-gui/conf/regex-urlfilter.txt file ? Feature request - option to merge segments or maybe removing old one ? Thanks, Bartosz
RE: Multilanguage support in Nutch 1.0
hi, do you have some metadata 'lang' on the pages . becoz the plugin try first to get the language form metadata.. if you see in the java source of the plugin LanguageIndexingFilter.java // check if LANGUAGE found, possibly put there by HTMLLanguageParser String lang = parse.getData().getParseMeta().get(Metadata.LANGUAGE); // check if HTTP-header tels us the language if (lang == null) { lang = parse.getData().getContentMeta().get(Response.CONTENT_LANGUAGE); } try to use also LUKE to check all your metadata on the index. From: da...@jashi.ge Date: Wed, 30 Sep 2009 17:22:26 +0400 Subject: Re: Multilanguage support in Nutch 1.0 To: nutch-user@lucene.apache.org On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote: hi try to activate the language-identifier plugin you must add it in the nutch-site.xml file in the nameplugin.includes/name section. Ooops. It IS activated. 2009-09-29 16:39:15,671 INFO plugin.PluginRepository - Language Identification Parser/Filter (language-identifier) But fetched pages are not passed to it, as I recon. _ Windows Live helps you keep up with all your friends, in one place. http://go.microsoft.com/?linkid=9660826
Re: R: Using Nutch for only retriving HTML
Thanks Magnús and Susam for your responses and pointing me in the right direction. I think I would spend time over the next few weeks trying out Nutch over. I only needed the HTML – I don’t care if it is in the Database or in separate files. Thanks guys, O.O. --- Mer 30/9/09, Magnús Skúlason magg...@gmail.com ha scritto: Da: Magnús Skúlason magg...@gmail.com Oggetto: Re: R: Using Nutch for only retriving HTML A: nutch-user@lucene.apache.org Data: Mercoledì 30 settembre 2009, 11:48 Actually its quite easy to modify the parse-html filter to do this. That is saving the HTML to a file or to some database, you could then configure it to skip all unnecessary plugins. I think it depends a lot on the other requirements you have whether using nutch for this task is the right way to go or not. If you can get by with wget -r then its probably an overkill to use nutch. Best regards, Magnus On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal susam@gmail.com wrote: On Wed, Sep 30, 2009 at 1:39 AM, O. Olson olson_...@yahoo.it wrote: Sorry for pushing this topic, but I would like to know if Nutch would help me get the raw HTML in my situation described below. I am sure it would be a simple answer to those who know Nutch. If not then I guess Nutch is the wrong tool for the job. Thanks, O. O. --- Gio 24/9/09, O. Olson olson_...@yahoo.it ha scritto: Da: O. Olson olson_...@yahoo.it Oggetto: Using Nutch for only retriving HTML A: nutch-user@lucene.apache.org Data: Giovedì 24 settembre 2009, 20:54 Hi, I am new to Nutch. I would like to completely crawl through an Internal Website and retrieve all the HTML Content. I don’t intend to do further processing using Nutch. The Website/Content is rather huge. By crawl, I mean that I would go to a page, download/archive the HTML, get the links from that page, and then download/archive those pages. I would keep doing this till I don’t have any new links. I don't think it is possible to retrieve pages and store them as separate files, one per page, without modifications in Nutch. I am not sure though. Someone would correct me if I am wrong here. However, it is easy to retrieve the HTML contents from the crawl DB using the Nutch API. But from your post, it seems, you don't want to do this. Is this possible? Is this the right tool for this job, or are there other tools out there that would be more suited for my purpose? I guess 'wget' is the tool you are looking for. You can use it with -r option to recursively download pages and store them as separate files on the hard disk, which is exactly what you need. You might want to use the -np option too. It is available for Windows as well as Linux. Regards, Susam Pal
RE: R: Using Nutch for only retriving HTML
hi mabe you can run a crawl (dont forget to filter the pages just to keep html or htm files (you will do it at conf/crawl-urlfilter.txt) ) after that you will go to the hadoop.log file and grep the sentence 'fetcher.Fetcher - fetching http' to get all the fetched urls. dont forget to sort the file and to make it uniq (command uniq -c) becoz sometimes the crawl try to fecth the poges several times if they will not answer the first time. when you have all your urls you can run wget on your file and archive the dowlowaded pages. hope it could help. Date: Wed, 30 Sep 2009 20:46:50 + From: olson_...@yahoo.it Subject: Re: R: Using Nutch for only retriving HTML To: nutch-user@lucene.apache.org Thanks Magnús and Susam for your responses and pointing me in the right direction. I think I would spend time over the next few weeks trying out Nutch over. I only needed the HTML – I don’t care if it is in the Database or in separate files. Thanks guys, O.O. --- Mer 30/9/09, Magnús Skúlason magg...@gmail.com ha scritto: Da: Magnús Skúlason magg...@gmail.com Oggetto: Re: R: Using Nutch for only retriving HTML A: nutch-user@lucene.apache.org Data: Mercoledì 30 settembre 2009, 11:48 Actually its quite easy to modify the parse-html filter to do this. That is saving the HTML to a file or to some database, you could then configure it to skip all unnecessary plugins. I think it depends a lot on the other requirements you have whether using nutch for this task is the right way to go or not. If you can get by with wget -r then its probably an overkill to use nutch. Best regards, Magnus On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal susam@gmail.com wrote: On Wed, Sep 30, 2009 at 1:39 AM, O. Olson olson_...@yahoo.it wrote: Sorry for pushing this topic, but I would like to know if Nutch would help me get the raw HTML in my situation described below. I am sure it would be a simple answer to those who know Nutch. If not then I guess Nutch is the wrong tool for the job. Thanks, O. O. --- Gio 24/9/09, O. Olson olson_...@yahoo.it ha scritto: Da: O. Olson olson_...@yahoo.it Oggetto: Using Nutch for only retriving HTML A: nutch-user@lucene.apache.org Data: Giovedì 24 settembre 2009, 20:54 Hi, I am new to Nutch. I would like to completely crawl through an Internal Website and retrieve all the HTML Content. I don’t intend to do further processing using Nutch. The Website/Content is rather huge. By crawl, I mean that I would go to a page, download/archive the HTML, get the links from that page, and then download/archive those pages. I would keep doing this till I don’t have any new links. I don't think it is possible to retrieve pages and store them as separate files, one per page, without modifications in Nutch. I am not sure though. Someone would correct me if I am wrong here. However, it is easy to retrieve the HTML contents from the crawl DB using the Nutch API. But from your post, it seems, you don't want to do this. Is this possible? Is this the right tool for this job, or are there other tools out there that would be more suited for my purpose? I guess 'wget' is the tool you are looking for. You can use it with -r option to recursively download pages and store them as separate files on the hard disk, which is exactly what you need. You might want to use the -np option too. It is available for Windows as well as Linux. Regards, Susam Pal _ We are your photos. Share us now with Windows Live Photos. http://go.microsoft.com/?linkid=9666047
RE: R: Using Nutch for only retriving HTML
me again, i forgot to tell u the easiest way... once the crawl is finished you can dump the whole db (it contains all the links to your html pages) in a text file.. ./bin/nutch readdb crawl_folder/crawldb/ -dump DBtextFile and you can perfor the wget on this db and archive the files From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: R: Using Nutch for only retriving HTML Date: Wed, 30 Sep 2009 21:04:03 + hi mabe you can run a crawl (dont forget to filter the pages just to keep html or htm files (you will do it at conf/crawl-urlfilter.txt) ) after that you will go to the hadoop.log file and grep the sentence 'fetcher.Fetcher - fetching http' to get all the fetched urls. dont forget to sort the file and to make it uniq (command uniq -c) becoz sometimes the crawl try to fecth the poges several times if they will not answer the first time. when you have all your urls you can run wget on your file and archive the dowlowaded pages. hope it could help. Date: Wed, 30 Sep 2009 20:46:50 + From: olson_...@yahoo.it Subject: Re: R: Using Nutch for only retriving HTML To: nutch-user@lucene.apache.org Thanks Magnús and Susam for your responses and pointing me in the right direction. I think I would spend time over the next few weeks trying out Nutch over. I only needed the HTML – I don’t care if it is in the Database or in separate files. Thanks guys, O.O. --- Mer 30/9/09, Magnús Skúlason magg...@gmail.com ha scritto: Da: Magnús Skúlason magg...@gmail.com Oggetto: Re: R: Using Nutch for only retriving HTML A: nutch-user@lucene.apache.org Data: Mercoledì 30 settembre 2009, 11:48 Actually its quite easy to modify the parse-html filter to do this. That is saving the HTML to a file or to some database, you could then configure it to skip all unnecessary plugins. I think it depends a lot on the other requirements you have whether using nutch for this task is the right way to go or not. If you can get by with wget -r then its probably an overkill to use nutch. Best regards, Magnus On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal susam@gmail.com wrote: On Wed, Sep 30, 2009 at 1:39 AM, O. Olson olson_...@yahoo.it wrote: Sorry for pushing this topic, but I would like to know if Nutch would help me get the raw HTML in my situation described below. I am sure it would be a simple answer to those who know Nutch. If not then I guess Nutch is the wrong tool for the job. Thanks, O. O. --- Gio 24/9/09, O. Olson olson_...@yahoo.it ha scritto: Da: O. Olson olson_...@yahoo.it Oggetto: Using Nutch for only retriving HTML A: nutch-user@lucene.apache.org Data: Giovedì 24 settembre 2009, 20:54 Hi, I am new to Nutch. I would like to completely crawl through an Internal Website and retrieve all the HTML Content. I don’t intend to do further processing using Nutch. The Website/Content is rather huge. By crawl, I mean that I would go to a page, download/archive the HTML, get the links from that page, and then download/archive those pages. I would keep doing this till I don’t have any new links. I don't think it is possible to retrieve pages and store them as separate files, one per page, without modifications in Nutch. I am not sure though. Someone would correct me if I am wrong here. However, it is easy to retrieve the HTML contents from the crawl DB using the Nutch API. But from your post, it seems, you don't want to do this. Is this possible? Is this the right tool for this job, or are there other tools out there that would be more suited for my purpose? I guess 'wget' is the tool you are looking for. You can use it with -r option to recursively download pages and store them as separate files on the hard disk, which is exactly what you need. You might want to use the -np option too. It is available for Windows as well as Linux. Regards, Susam Pal _ We are your photos. Share us now with Windows Live Photos. http://go.microsoft.com/?linkid=9666047 _ Attention all humans. We are your photos. Free us. http://go.microsoft.com/?linkid=9666046
Re: R: Using Nutch for only retriving HTML
BELLINI ADAM wrote: me again, i forgot to tell u the easiest way... once the crawl is finished you can dump the whole db (it contains all the links to your html pages) in a text file.. ./bin/nutch readdb crawl_folder/crawldb/ -dump DBtextFile and you can perfor the wget on this db and archive the files I'd argue with this advice. The goal here is to obtain the HTML pages. If you have crawled them, then why do it again? You already have their content locally. However, page content is NOT stored in crawldb, it's stored in segments. So you need to dump the content from segments, and not the content of crawldb. The command 'bin/nutch readseg -dump segmentName output' should do the trick. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: graphical user interface v0.2 for nutch
There is a nutch developer in my neighborhood. Yes sir. So lets stay in touch. Mario 2009/9/24, Marko Bauhardt m...@101tec.com: Hi list. we have pushed the second nutch gui release version 0.2. You can download the binary or the sources on http://github.com/101tec/nutch/downloads Two main features are implemented in this version + Security. You can start the admin gui with login feature, usernames and passwords can be configured in a separate file (see http://wiki.github.com/101tec/nutch/security) . + If you push a new crawled index to search the searcher will be releoad the index automatically. marko ~~~ 101tec GmbH Halle (Saale), Saxony-Anhalt, Germany http://www.101tec.com -- Von meinen Mobilgerät aus gesendet http://www.ironschroedi.com/de/