Re: Usage of nutch:
Hi Julien, Any update on the mongodb plugin for nutch?? Using https://github.com/ctjmorgan/nutch-mongodb-indexer is a problem for me as i dont know how to create a new package and i cant find the ivy folders. It way too complex for a non-java developer. Currently i have installed nutch 1.6 on my windows machine and i need to integrate it with mongodb Julien Nioche-4 wrote On 16 November 2011 20:27, ctjmorgan lt; cmorgan@ gt; wrote: Recently went through a similar situation. Check out the two project below that I posted up to github. Hope they help... https://github.com/ctjmorgan/nutch-mongdb-parser https://github.com/ctjmorgan/nutch-mongodb-indexer The first example allows you prepare a set of URLs contained in Mongdb for Nutch to crawl. Not sure parser is the right name for it, sounds more like a variant of the injector (haven't looked at code though) The second indexes the information from Nutch into Mongodb similiar in the same way the SolrIndexer works. There are plans for a pluggable indexing backend so that we can send the documents to [SOLR|ElasticSearch|...] This would allow to write expose the MongoDB indexer as a plugin instead of piggybacking the SOLR code. Thanks for sharing these links, it's always interesting to know what people do with/around Nutch Julien -- View this message in context: http://lucene.472066.n3.nabble.com/Usage-of-nutch-tp1894986p3513843.html Sent from the Nutch - User mailing list archive at Nabble.com. -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com -- View this message in context: http://lucene.472066.n3.nabble.com/Usage-of-nutch-tp1894986p4036407.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Installation of NUTCH on windows7
Hi, Changing the hadoop jar file to a lower version solved the issue I removed hadoop-core-1.0.3.jar from the lib folder and replaced it with hadoop-core-0.20.2.jar file Sebastian Nagel wrote Hi, that's a known problem with Hadoop on Windows / Cygwin: https://issues.apache.org/jira/browse/HADOOP-7682 I don't know whether there are is a reliable fix or a word-around but you should search for the error - you are not alone ;-) Sebastian On 01/25/2013 12:49 PM, Revathi R wrote: Hello I am Trying to install NUTCH on windows7 I got error loke this D:\Nutch-1\apache-nutch-1.6-bin\NUTCH TEST\Nutch 1.6\win32\binnutch crawl D:\Nutch-1\apache-nutch-1.6-bin\NUTCH TEST\URLs -dir D:\Nutch-1\apache-nutch-1.6-bin\NUTCH TEST\Nutch 1.6\win32\bin File Not Found The system cannot find the file specified. solrUrl is not set, indexing will be skipped... crawl started in: D:/Nutch-1/apache-nutch-1.6-bin/NUTCH TEST/Nutch 1.6/win32/bin rootUrlDir = D:/Nutch-1/apache-nutch-1.6-bin/NUTCH TEST/URLs threads = 10 depth = 5 solrUrl=null Injector: starting at 2013-01-25 15:46:14 Injector: crawlDb: D:/Nutch-1/apache-nutch-1.6-bin/NUTCH TEST/Nutch 1.6/win32/bin/crawldb Injector: urlDir: D:/Nutch-1/apache-nutch-1.6-bin/NUTCH TEST/URLs Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: Failed to set permissions of path: \tmp\hadoop-revathi_ramanadham\mapred\staging\revathi_ramanadham818841982\.staging to 0700 at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261) at org.apache.nutch.crawl.Injector.inject(Injector.java:281) at org.apache.nutch.crawl.Crawl.run(Crawl.java:127) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) Regards, Revathi R. -- View this message in context: http://lucene.472066.n3.nabble.com/Installation-of-NUTCH-on-windows7-tp4036210.html Sent from the Nutch - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Installation-of-NUTCH-on-windows7-tp4036210p4036404.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: bin/nutch
i get a similar error for nutch 2.1 ,how do i fix it? : Buildfile: C:\apache-nutch-2.1\build.xml [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. ivy-probe-antlib: ivy-download: [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. ivy-download-unchecked: ivy-init-antlib: ivy-init: init: clean-lib: [delete] Deleting directory C:\apache-nutch-2.1\build\lib resolve-default: [ivy:resolve] :: Ivy 2.2.0 - 20100923230623 :: http://ant.apache.org/ivy/ :: [ivy:resolve] :: loading settings :: file = C:\apache-nutch-2.1\ivy\ivysettings.xml [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. copy-libs: compile-core: [javac] C:\apache-nutch-2.1\build.xml:97: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 181 source files to C:\apache-nutch-2.1\build\classes [javac] warning: [path] bad path element C:\apache-nutch-2.1\build\lib\activation.jar: no such file or directory [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] C:\apache-nutch-2.1\src\java\org\apache\nutch\api\APIInfoResource.java:23: error: package org.restlet.resource does not exist [javac] import org.restlet.resource.Get; [javac]^ [javac] C:\apache-nutch-2.1\src\java\org\apache\nutch\api\APIInfoResource.java:24: error: package org.restlet.resource does not exist [javac] import org.restlet.resource.ServerResource; [javac]^ [javac] C:\apache-nutch-2.1\src\java\org\apache\nutch\api\APIInfoResource.java:26: error: cannot find symbol [javac] public class APIInfoResource extends ServerResource { [javac] ^ [javac] symbol: class ServerResource [javac] C:\apache-nutch-2.1\src\java\org\apache\nutch\api\AdminResource.java:23: error: package org.restlet.resource does not exist [javac] import org.restlet.resource.Get; [javac]^ [javac] C:\apache-nutch-2.1\src\java\org\apache\nutch\api\AdminResource.java:24: error: package org.restlet.resource does not exist [javac] import org.restlet.resource.ServerResource; [javac]^ [javac] C:\apache-nutch-2.1\src\java\org\apache\nutch\api\AdminResource.java:28: error: cannot find symbol [javac] public class AdminResource extends ServerResource { [javac]^ [javac] symbol: class ServerResource [javac] C:\apache-nutch-2.1\src\java\org\apache\nutch\api\ConfResource.java:22: error: package org.restlet.data does not exist [javac] import org.restlet.data.Form; [javac]^ [javac] C:\apache-nutch-2.1\src\java\org\apache\nutch\api\ConfResource.java:23: error: package org.restlet.resource does not exist [javac] import org.restlet.resource.Delete; [javac]^ [javac] C:\apache-nutch-2.1\src\java\org\apache\nutch\api\ConfResource.java:24: error: package org.restlet.resource does not exist [javac] import org.restlet.resource.Get; [javac]^ [javac] C:\apache-nutch-2.1\src\java\org\apache\nutch\api\ConfResource.java:25: error: package org.restlet.resource does not exist [javac] import org.restlet.resource.Post; [javac]^ [javac] C:\apache-nutch-2.1\src\java\org\apache\nutch\api\ConfResource.java:26: error: package org.restlet.resource does not exist [javac] import org.restlet.resource.Put; [javac]^ [javac] C:\apache-nutch-2.1\src\java\org\apache\nutch\api\ConfResource.java:27: error: package org.restlet.resource does not exist [javac] import org.restlet.resource.ServerResource; [javac]^ [javac] C:\apache-nutch-2.1\src\java\org\apache\nutch\api\ConfResource.java:29: error: cannot find symbol [javac] public class ConfResource extends ServerResource { [javac] ^ [javac] symbol: class ServerResource [javac] C:\apache-nutch-2.1\src\java\org\apache\nutch\api\DbReader.java:29: error: package org.apache.avro.util does not exist [javac] import org.apache.avro.util.Utf8; [javac]^ [javac] C:\apache-nutch-2.1\src\java\org\apache\nutch\api\DbReader.java:30: error: package org.apache.gora.query does not exist [javac] import org.apache.gora.query.Query; [javac] ^ [javac] C:\apache-nutch-2.1\src\java\org\apache\nutch\api\DbReader.java:31: error: package org.apache.gora.query does not exist [javac] import org.apache.gora.query.Result; [javac] ^ [javac]
Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1
I tried increasing the numbers of threads to 50 but the speed is not affected I tried changing the partition.url.mode value to byDomain and fetcher.queue.mode to byDomain but still it does not help the speed. It seems to get urls from 2 domains now and the other domains are not getting crawled. Is this due to the url score? if so how do i crawl urls from all the domains? lewis john mcgibbney wrote Increase number of threads when fetching Also please see nutch-deault.xml for paritioning of urls, if you know your target domains you may wish to adapt the policy. Lewis On Sunday, January 27, 2013, peterbarretto lt; peterbarretto08@ gt; wrote: I want to increase the number of urls fetched at a time in nutch. I have around 10 websites to crawl. so how can i crawl all the sites at a time ? right now i am fetching 1 site with a fetch delay of 2 second but it is too slow. How to concurrently fetch from different domain? -- View this message in context: http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis* -- View this message in context: http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: JAVA_HOME is not set
Tried escaping the whitespace but it still did not work so i installed java in another folder and now the installation work just fine Stefan Scheffler wrote Hi, Try to escape the whitespace in Program Files. I think it should look like Program\ Files. But i am not sure Regards Stefan Am 25.01.2013 19:51, schrieb Gora Mohanty: On 25 January 2013 16:05, peterbarretto lt; peterbarretto08@ gt; wrote: I still get the below error after setting the java home variable lt;http://lucene.472066.n3.nabble.com/file/n4036204/nutch_java_home_error.pnggt; Not sure of how much experience you have had with Unix-style shell quoting, but this would have been amenable to a simple Google search for cygwin export variable space. Here is a helpful link: http://cygwin.com/ml/cygwin/2005-08/msg01278.html , but do not have a Cygwin installation to actually try this out. Regards, Gora -- View this message in context: http://lucene.472066.n3.nabble.com/JAVA-HOME-is-not-set-tp617447p4036999.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1
Hi Tejas, I changed the generate.count.mode to domain and generate.max.count to 100 but still it shows queue mode as byhost and not by domain. peterbarretto wrote Hi Tejas The fetcher.threads.per.host property has been depreciated and replaced with fetcher.threads.per.queue I am not sue if fetcher.threads.per.queue will hepl the fetching as the generator only generates the fetchlist from 2 or 3 domain. How can i tell the generator to create fetchlist with equal number of urls from all domain? I am sure there are urls from the other domains but i guess since the url score is less it fetches from only 2 domains. I will try increasing fetcher.threads.per.queue to 5 and see if the fetch speed is increased and let you know Tejas Patil wrote Hey Peter, I am guessing that you have just increased the global thread count. Have you even increased fetcher.threads.per.host ? This will improve the crawl rate as multiple threads can attack the same site. Dont make it too high or else the system will get overloaded. The nutch wiki has an article [0] about the potential reasons for slow crawls and some good suggestions. [0] : https://wiki.apache.org/nutch/OptimizingCrawls Thanks, Tejas Patil On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto lt; peterbarretto08@ gt;wrote: I tried increasing the numbers of threads to 50 but the speed is not affected I tried changing the partition.url.mode value to byDomain and fetcher.queue.mode to byDomain but still it does not help the speed. It seems to get urls from 2 domains now and the other domains are not getting crawled. Is this due to the url score? if so how do i crawl urls from all the domains? lewis john mcgibbney wrote Increase number of threads when fetching Also please see nutch-deault.xml for paritioning of urls, if you know your target domains you may wish to adapt the policy. Lewis On Sunday, January 27, 2013, peterbarretto lt; peterbarretto08@ gt; wrote: I want to increase the number of urls fetched at a time in nutch. I have around 10 websites to crawl. so how can i crawl all the sites at a time ? right now i am fetching 1 site with a fetch delay of 2 second but it is too slow. How to concurrently fetch from different domain? -- View this message in context: http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis* -- View this message in context: http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html Sent from the Nutch - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036976.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: How to get page content of crawled pages
I have tried the repo https://github.com/ctjmorgan/nutch-mongodb-indexer and it does not work I guess this is not working as it is mentioned it is for nutch 1.3 and i am using 1.6 I get the below output when i try to rebuild :- Buildfile: C:\nutch-16\build.xml [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. ivy-probe-antlib: ivy-download: [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. ivy-download-unchecked: ivy-init-antlib: ivy-init: init: clean-lib: [delete] Deleting directory C:\nutch-16\build\lib resolve-default: [ivy:resolve] :: Ivy 2.2.0 - 20100923230623 :: http://ant.apache.org/ivy/ :: [ivy:resolve] :: loading settings :: file = C:\nutch-16\ivy\ivysettings.xml [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. copy-libs: compile-core: [javac] C:\nutch-16\build.xml:96: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 1 source file to C:\nutch-16\build\classes [javac] warning: [path] bad path element C:\nutch-16\build\lib\activation.jar: no such file or directory [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:7: warning: [deprecation] JobConf in org.apache.hadoop.mapred has been deprecated [javac] import org.apache.hadoop.mapred.JobConf; [javac]^ [javac] C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18: error: MongodbWriter is not abstract and does not override abstract method delete(String) in NutchIndexWriter [javac] public class MongodbWriter implements NutchIndexWriter{ [javac]^ [javac] C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:23: warning: [deprecation] JobConf in org.apache.hadoop.mapred has been deprecated [javac] public void open(JobConf job, String name) throws IOException { [javac] ^ [javac] 1 error [javac] 4 warnings I have already crawled some urls now and i need to move those to mongodb. Is there a easy to use code to do that? I am new to java so will require all the steps of how to add the code and all. Jorge Luis Betancourt Gonzalez wrote I suppose you can write a custom indexer, to store the data in mongodb instead of solr, I think there is an open repo on github about this. - Mensaje original - De: peterbarretto lt; peterbarretto08@ gt; Para: user@.apache Enviados: Martes, 29 de Enero 2013 8:46:04 Asunto: Re: How to get page content of crawled pages Hi Is there a way i can dump the url and url content in mongodb? Klemens Muthmann wrote Hi, Super. That works. Thank you. I thereby also found the class that shows how to achieve this within Java code, which is org.apache.nutch.segment.SegmentReader. Thanks again and bye Klemens Am 22.11.2010 10:49, schrieb Hannes Carl Meyer: Hi Klemens, you should run ./bin/nutch readseg! For example: ./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder -nofetch -nogenerate -noparse -noparsedata -noparsetex Kind Regards from Hannover Hannes On Mon, Nov 22, 2010 at 9:23 AM, Klemens Muthmann klemens.muthmann@ wrote: Hi, I did a small crawl of some pages on the web and want to geht the raw HTML content of these pages now. Reading the documentation in the wiki I guess this content might be somewhere under crawl/segments/20101122071139/content/part-0. I also guess I can access this content using the Hadoop API like described here: http://wiki.apache.org/nutch/Getting_Started However I have absolutely no idea how to configure: MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf); The Hadoop documentation is not very helpful either. May someone please point me in the right direction to get the page content? Thank you and regards Klemens Muthmann -- Dipl.-Medieninf., Klemens Muthmann Wissenschaftlicher Mitarbeiter Technische Universität Dresden Fakultät Informatik Institut für Systemarchitektur Lehrstuhl Rechnernetze 01062 Dresden Tel.: +49 (351) 463-38214 Fax: +49 (351) 463-38251 E-Mail: klemens.muthmann@ -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037023.html Sent from the Nutch - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037283.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1
08:49:34,476 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2013-01-29 08:49:34,476 INFO plugin.PluginRepository - Registered Extension-Points: 2013-01-29 08:49:34,476 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2013-01-29 08:49:34,477 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2013-01-29 08:49:34,477 INFO plugin.PluginRepository - Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) 2013-01-29 08:49:34,477 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2013-01-29 08:49:34,477 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2013-01-29 08:49:34,477 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2013-01-29 08:49:34,477 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2013-01-29 08:49:34,477 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2013-01-29 08:49:34,546 INFO fetcher.Fetcher - Using queue mode : byHost 2013-01-29 08:49:34,548 INFO fetcher.Fetcher - Using queue mode : byHost 2013-01-29 08:49:34,548 INFO fetcher.Fetcher - fetching http://www.example.com 2013-01-29 08:49:34,549 INFO fetcher.Fetcher - Using queue mode : byHost Tejas Patil wrote Hey Peter, Give a bigger value for topN parameter. Also, use: property name generate.max.count /name value -1 /value /property property name generate.count.mode /name value domain /value /property Not sure why you see queue mode as byhost and not by domain. Did it print that in the logs ? I should have asked you this before : Are you using nutch 1.X or 2.x ? thanks, Tejas Patil On Tue, Jan 29, 2013 at 12:08 AM, peterbarretto lt; peterbarretto08@ gt;wrote: Hi Tejas, I changed the generate.count.mode to domain and generate.max.count to 100 but still it shows queue mode as byhost and not by domain. peterbarretto wrote Hi Tejas The fetcher.threads.per.host property has been depreciated and replaced with fetcher.threads.per.queue I am not sue if fetcher.threads.per.queue will hepl the fetching as the generator only generates the fetchlist from 2 or 3 domain. How can i tell the generator to create fetchlist with equal number of urls from all domain? I am sure there are urls from the other domains but i guess since the url score is less it fetches from only 2 domains. I will try increasing fetcher.threads.per.queue to 5 and see if the fetch speed is increased and let you know Tejas Patil wrote Hey Peter, I am guessing that you have just increased the global thread count. Have you even increased fetcher.threads.per.host ? This will improve the crawl rate as multiple threads can attack the same site. Dont make it too high or else the system will get overloaded. The nutch wiki has an article [0] about the potential reasons for slow crawls and some good suggestions. [0] : https://wiki.apache.org/nutch/OptimizingCrawls Thanks, Tejas Patil On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto lt; peterbarretto08@ gt;wrote: I tried increasing the numbers of threads to 50 but the speed is not affected I tried changing the partition.url.mode value to byDomain and fetcher.queue.mode to byDomain but still it does not help the speed. It seems to get urls from 2 domains now and the other domains are not getting crawled. Is this due to the url score? if so how do i crawl urls from all the domains? lewis john mcgibbney wrote Increase number of threads when fetching Also please see nutch-deault.xml for paritioning of urls, if you know your target domains you may wish to adapt the policy. Lewis On Sunday, January 27, 2013, peterbarretto lt; peterbarretto08@ gt; wrote: I want to increase the number of urls fetched at a time in nutch. I have around 10 websites to crawl. so how can i crawl all the sites at a time ? right now i am fetching 1 site with a fetch delay of 2 second but it is too slow. How to concurrently fetch from different domain? -- View this message in context: http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis* -- View this message in context: http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html Sent from the Nutch - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2
Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1
Hi Lewis, You are not getting very many URLs! Should i increase the fetcher.server.delay from 2 to 5 seconds? I did not get what you meant by it? I want somewhat around equal number of urls in the fetchlist from all domain so that i can fetch more number of urls at a time lewis john mcgibbney wrote You are not getting very many URLs! On Tue, Jan 29, 2013 at 8:29 PM, peterbarretto lt; peterbarretto08@ gt;wrote: 2013-01-29 08:44:35,014 INFO crawl.CrawlDbReader - TOTAL urls: 96404 2013-01-29 08:44:35,018 INFO crawl.CrawlDbReader - status 1 (db_unfetched): 85672 -- View this message in context: http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4037612.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: How to get page content of crawled pages
Hi Lewis, I am new to java and i dont know how to inherit all public methods from NutchIndexWriter Can you help me with that? Then i can rebuild and check if it works. lewis john mcgibbney wrote As you will see the code has not been amended in a year or so. The positive side is that you only seem to be getting one issue with javac On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto lt; peterbarretto08@ gt;wrote: C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18: error: MongodbWriter is not abstract and does not override abstract method delete(String) in NutchIndexWriter [javac] public class MongodbWriter implements NutchIndexWriter{ Sort this error out by inheriting all public methods from NutchIndexWriter for starts. I take it you are not developing from within Eclipse? As this would have been flagged up immediately. This should at least enable you to compile the code. I have already crawled some urls now and i need to move those to mongodb. Is there a easy to use code to do that? Not apart from hacking the code as you are already doing. The code you are pulling is not part of the official nutch codebase and to be honest a few of us didn't even know about it until you brought it to our attention :0) There is no silver bullet here, just take your time and we will get it working. Lewis -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1
Hi Tejas, I am currently running nutch 1.6 on windows 7, pentium dual core 2.8Ghz, 2 GB ram I will be using amazon ec2 servers later for crawling. What was ur hardware when you ran 4 million urls with 80Gb data? Will nutch 2.1 give a faster crawl speed than 1.6? Tejas Patil wrote I had ran crawls with topN as large as 4 million while having crawldb of ~80 GB. It worked fine without any such issue. Maybe the hardware / cluster you have is not capable of handling load above 500. Note that if topN is low, then no matter how many fetcher threads you create, you wont be able to increase #crawls. Also, as there is a considerable amount of time spent in generate and update phase, overall crawl rate will be low. If you are planning to use the same machine, you will have to work with lower values (and thus expect lower crawl rate). thanks, Tejas Patil On Wed, Jan 30, 2013 at 8:06 PM, Lewis John Mcgibbney lewis.mcgibbney@ wrote: You are not getting very many URLs! On Tue, Jan 29, 2013 at 8:29 PM, peterbarretto lt; peterbarretto08@ gt; wrote: 2013-01-29 08:44:35,014 INFO crawl.CrawlDbReader - TOTAL urls: 96404 2013-01-29 08:44:35,018 INFO crawl.CrawlDbReader - status 1 (db_unfetched): 85672 -- View this message in context: http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4037637.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: How to get page content of crawled pages
Hi Lewis, I managed to get the code working by adding the below function to MongodbWriter.java in the public class MongodbWriter implements NutchIndexWriter :- public void delete(String key) throws IOException{ return; } And the crawled data was getting stored in mongodb. The only issue was it was storing only the text of the page and not the full html content of the page. How do i store the full html content of the page also? Hope to see the patches soon. Thanks lewis john mcgibbney wrote Certainly. I am currently reviewing the code and will hopefully have patches for Nutch trunk cooked up for tomorrow. I'll update this thread likewise. Thanks Lewis On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto lt; peterbarretto08@ gt; wrote: Hi Lewis, I am new to java and i dont know how to inherit all public methods from NutchIndexWriter Can you help me with that? Then i can rebuild and check if it works. lewis john mcgibbney wrote As you will see the code has not been amended in a year or so. The positive side is that you only seem to be getting one issue with javac On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto lt; peterbarretto08@ gt;wrote: C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18: error: MongodbWriter is not abstract and does not override abstract method delete(String) in NutchIndexWriter [javac] public class MongodbWriter implements NutchIndexWriter{ Sort this error out by inheriting all public methods from NutchIndexWriter for starts. I take it you are not developing from within Eclipse? As this would have been flagged up immediately. This should at least enable you to compile the code. I have already crawled some urls now and i need to move those to mongodb. Is there a easy to use code to do that? Not apart from hacking the code as you are already doing. The code you are pulling is not part of the official nutch codebase and to be honest a few of us didn't even know about it until you brought it to our attention :0) There is no silver bullet here, just take your time and we will get it working. Lewis -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Lewis -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039401.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: How to get page content of crawled pages
Hi Lewis, I downloaded the nutch copy from http://apache.techartifact.com/mirror/nutch/1.6/ lewis john mcgibbney wrote Hi, Once I get access to my office I am going to build the patches from trunk. Is it trunk that you are using? Thanks Lewis On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto lt; peterbarretto08@ gt;wrote: Hi Lewis, I managed to get the code working by adding the below function to MongodbWriter.java in the public class MongodbWriter implements NutchIndexWriter :- public void delete(String key) throws IOException{ return; } And the crawled data was getting stored in mongodb. The only issue was it was storing only the text of the page and not the full html content of the page. How do i store the full html content of the page also? Hope to see the patches soon. Thanks lewis john mcgibbney wrote Certainly. I am currently reviewing the code and will hopefully have patches for Nutch trunk cooked up for tomorrow. I'll update this thread likewise. Thanks Lewis On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto lt; peterbarretto08@ gt; wrote: Hi Lewis, I am new to java and i dont know how to inherit all public methods from NutchIndexWriter Can you help me with that? Then i can rebuild and check if it works. lewis john mcgibbney wrote As you will see the code has not been amended in a year or so. The positive side is that you only seem to be getting one issue with javac On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto lt; peterbarretto08@ gt;wrote: C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18: error: MongodbWriter is not abstract and does not override abstract method delete(String) in NutchIndexWriter [javac] public class MongodbWriter implements NutchIndexWriter{ Sort this error out by inheriting all public methods from NutchIndexWriter for starts. I take it you are not developing from within Eclipse? As this would have been flagged up immediately. This should at least enable you to compile the code. I have already crawled some urls now and i need to move those to mongodb. Is there a easy to use code to do that? Not apart from hacking the code as you are already doing. The code you are pulling is not part of the official nutch codebase and to be honest a few of us didn't even know about it until you brought it to our attention :0) There is no silver bullet here, just take your time and we will get it working. Lewis -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Lewis -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039401.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis* -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039613.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: How to get page content of crawled pages
Hi Lewis, Is this patch done?? lewis john mcgibbney wrote Hi, Once I get access to my office I am going to build the patches from trunk. Is it trunk that you are using? Thanks Lewis On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto lt; peterbarretto08@ gt;wrote: Hi Lewis, I managed to get the code working by adding the below function to MongodbWriter.java in the public class MongodbWriter implements NutchIndexWriter :- public void delete(String key) throws IOException{ return; } And the crawled data was getting stored in mongodb. The only issue was it was storing only the text of the page and not the full html content of the page. How do i store the full html content of the page also? Hope to see the patches soon. Thanks lewis john mcgibbney wrote Certainly. I am currently reviewing the code and will hopefully have patches for Nutch trunk cooked up for tomorrow. I'll update this thread likewise. Thanks Lewis On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto lt; peterbarretto08@ gt; wrote: Hi Lewis, I am new to java and i dont know how to inherit all public methods from NutchIndexWriter Can you help me with that? Then i can rebuild and check if it works. lewis john mcgibbney wrote As you will see the code has not been amended in a year or so. The positive side is that you only seem to be getting one issue with javac On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto lt; peterbarretto08@ gt;wrote: C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18: error: MongodbWriter is not abstract and does not override abstract method delete(String) in NutchIndexWriter [javac] public class MongodbWriter implements NutchIndexWriter{ Sort this error out by inheriting all public methods from NutchIndexWriter for starts. I take it you are not developing from within Eclipse? As this would have been flagged up immediately. This should at least enable you to compile the code. I have already crawled some urls now and i need to move those to mongodb. Is there a easy to use code to do that? Not apart from hacking the code as you are already doing. The code you are pulling is not part of the official nutch codebase and to be honest a few of us didn't even know about it until you brought it to our attention :0) There is no silver bullet here, just take your time and we will get it working. Lewis -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Lewis -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039401.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis* -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040596.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: How to get page content of crawled pages
Thanks for the patch Lewis. Where do i make the pom.xml changes i cant find the file? Also in 1.6 if i give the below command it returns the html content ./bin/nutch readseg -dump crawl/segments/20090903121951 toto -nofetch -nogenerate -noparse -noparsedata -noparsetext I havent built the patch changes as i cant find pom.xml file. lewis john mcgibbney wrote https://issues.apache.org/jira/browse/NUTCH-1528 This is the mongodb indexer patch ported to trunk. Can I mention that there is usually no time line on these things e.g. feature requests. I'm sure you can appreciate that we are all extremely busy at work with an array of other things so if it takes a bit of time, then thats OK. The world goes on and keeps spinning. Even if we are getting bombarded by meteorites in Russia!!! Please check the patch and out comment accordingly. Regarding your issue with regards to the full page content, I am not sure if this is currently available in Nutch trunk with out you writing some code. Full html markup is certainly stored in 2.x... but I don't know whether you are prepared to move to 2.x for your operations? hth Lewis On Fri, Feb 15, 2013 at 1:58 AM, peterbarretto lt; peterbarretto08@ gt;wrote: Hi Lewis, Is this patch done?? lewis john mcgibbney wrote Hi, Once I get access to my office I am going to build the patches from trunk. Is it trunk that you are using? Thanks Lewis On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto lt; peterbarretto08@ gt;wrote: Hi Lewis, I managed to get the code working by adding the below function to MongodbWriter.java in the public class MongodbWriter implements NutchIndexWriter :- public void delete(String key) throws IOException{ return; } And the crawled data was getting stored in mongodb. The only issue was it was storing only the text of the page and not the full html content of the page. How do i store the full html content of the page also? Hope to see the patches soon. Thanks lewis john mcgibbney wrote Certainly. I am currently reviewing the code and will hopefully have patches for Nutch trunk cooked up for tomorrow. I'll update this thread likewise. Thanks Lewis On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto lt; peterbarretto08@ gt; wrote: Hi Lewis, I am new to java and i dont know how to inherit all public methods from NutchIndexWriter Can you help me with that? Then i can rebuild and check if it works. lewis john mcgibbney wrote As you will see the code has not been amended in a year or so. The positive side is that you only seem to be getting one issue with javac On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto lt; peterbarretto08@ gt;wrote: C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18: error: MongodbWriter is not abstract and does not override abstract method delete(String) in NutchIndexWriter [javac] public class MongodbWriter implements NutchIndexWriter{ Sort this error out by inheriting all public methods from NutchIndexWriter for starts. I take it you are not developing from within Eclipse? As this would have been flagged up immediately. This should at least enable you to compile the code. I have already crawled some urls now and i need to move those to mongodb. Is there a easy to use code to do that? Not apart from hacking the code as you are already doing. The code you are pulling is not part of the official nutch codebase and to be honest a few of us didn't even know about it until you brought it to our attention :0) There is no silver bullet here, just take your time and we will get it working. Lewis -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Lewis -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039401.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis* -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040596.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis* -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040944.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: How to get page content of crawled pages
Hi Lewis, I have never used a patch before but after searching a bit managed to apply the patch in cygwin. (had to reinstall cygwin with the patch tool as the path command was not present in the previous install) I installed the patch by skipping pom.xml file and it worked. I can copy all the crawled urls to the mongodb. I can get the html content of crawled urls from the readseg -dump command in nutch 1.6 so i guess it will be possible to get full html along with just the text part? lewis john mcgibbney wrote Hi Peter On Saturday, February 16, 2013, peterbarretto lt;peterbarretto08@gmail.gt; Where do i make the pom.xml changes i cant find the file? What are you talking about? I made a patch which pulls everything for you. There should be no changes required. I havent built the patch changes as i cant find pom.xml file. The maven project file is in the root project. We do not build nutch with ?aven. Currently for development we use ant tasks and ivy for dependencies. lewis john mcgibbney wrote https://issues.apache.org/jira/browse/NUTCH-1528 This is the mongodb indexer patch ported to trunk. Can I mention that there is usually no time line on these things e.g. feature requests. I'm sure you can appreciate that we are all extremely busy at work with an array of other things so if it takes a bit of time, then thats OK. The world goes on and keeps spinning. Even if we are getting bombarded by meteorites in Russia!!! Please check the patch and out comment accordingly. Regarding your issue with regards to the full page content, I am not sure if this is currently available in Nutch trunk with out you writing some code. Full html markup is certainly stored in 2.x... but I don't know whether you are prepared to move to 2.x for your operations? hth Lewis On Fri, Feb 15, 2013 at 1:58 AM, peterbarretto lt; peterbarretto08@ gt;wrote: Hi Lewis, Is this patch done?? lewis john mcgibbney wrote Hi, Once I get access to my office I am going to build the patches from trunk. Is it trunk that you are using? Thanks Lewis On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto lt; peterbarretto08@ gt;wrote: Hi Lewis, I managed to get the code working by adding the below function to MongodbWriter.java in the public class MongodbWriter implements NutchIndexWriter :- public void delete(String key) throws IOException{ return; } And the crawled data was getting stored in mongodb. The only issue was it was storing only the text of the page and not the full html content of the page. How do i store the full html content of the page also? Hope to see the patches soon. Thanks lewis john mcgibbney wrote Certainly. I am currently reviewing the code and will hopefully have patches for Nutch trunk cooked up for tomorrow. I'll update this thread likewise. Thanks Lewis On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto lt; peterbarretto08@ gt; wrote: Hi Lewis, I am new to java and i dont know how to inherit all public methods from NutchIndexWriter Can you help me with that? Then i can rebuild and check if it works. lewis john mcgibbney wrote As you will see the code has not been amended in a year or so. The positive side is that you only seem to be getting one issue with javac On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto lt; peterbarretto08@ gt;wrote: C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18: error: MongodbWriter is not abstract and does not override abstract method delete(String) in NutchIndexWriter [javac] public class MongodbWriter implements NutchIndexWriter{ Sort this error out by inheriting all public methods from NutchIndexWriter for starts. I take it you are not developing from within Eclipse? As this would have been flagged up immediately. This should at least enable you to compile the code. I have already crawled some urls now and i need to move those to mongodb. Is View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040944.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis* -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4041066.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: How to get page content of crawled pages
Hi Lewis, I tried applying the patch on 2.1 but it gives the below error: patching file pom.xml patching file ivy/ivy.xml Hunk #1 succeeded at 34 with fuzz 2 (offset 4 lines). patching file src/bin/nutch Hunk #1 FAILED at 61. Hunk #2 succeeded at 220 with fuzz 2 (offset 2 lines). 1 out of 2 hunks FAILED -- saving rejects to file src/bin/nutch.rej patching file src/java/org/apache/nutch/indexer/mongodb/MongoDbWriter.java patching file src/java/org/apache/nutch/indexer/mongodb/MongoDbConstants.java patching file src/java/org/apache/nutch/indexer/mongodb/MongoDbIndexer.java -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4053146.html Sent from the Nutch - User mailing list archive at Nabble.com.