Re: Usage of nutch:

2013-01-25 Thread peterbarretto
Hi Julien, Any update on the mongodb plugin for nutch?? Using https://github.com/ctjmorgan/nutch-mongodb-indexer is a problem for me as i dont know how to create a new package and i cant find the ivy folders. It way too complex for a non-java developer. Currently i have installed nutch 1.6 on my

Re: Installation of NUTCH on windows7

2013-01-25 Thread peterbarretto
Hi, Changing the hadoop jar file to a lower version solved the issue I removed hadoop-core-1.0.3.jar from the lib folder and replaced it with hadoop-core-0.20.2.jar file Sebastian Nagel wrote Hi, that's a known problem with Hadoop on Windows / Cygwin:

Re: bin/nutch

2013-01-26 Thread peterbarretto
i get a similar error for nutch 2.1 ,how do i fix it? : Buildfile: C:\apache-nutch-2.1\build.xml [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. ivy-probe-antlib: ivy-download: [taskdef] Could not load definitions from resource

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

2013-01-27 Thread peterbarretto
, January 27, 2013, peterbarretto lt; peterbarretto08@ gt; wrote: I want to increase the number of urls fetched at a time in nutch. I have around 10 websites to crawl. so how can i crawl all the sites at a time ? right now i am fetching 1 site with a fetch delay of 2 second but it is too slow

Re: JAVA_HOME is not set

2013-01-29 Thread peterbarretto
Am 25.01.2013 19:51, schrieb Gora Mohanty: On 25 January 2013 16:05, peterbarretto lt; peterbarretto08@ gt; wrote: I still get the below error after setting the java home variable lt;http://lucene.472066.n3.nabble.com/file/n4036204/nutch_java_home_error.pnggt; Not sure of how much

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

2013-01-29 Thread peterbarretto
Hi Tejas, I changed the generate.count.mode to domain and generate.max.count to 100 but still it shows queue mode as byhost and not by domain. peterbarretto wrote Hi Tejas The fetcher.threads.per.host property has been depreciated and replaced with fetcher.threads.per.queue I am not sue

Re: How to get page content of crawled pages

2013-01-30 Thread peterbarretto
to add the code and all. Jorge Luis Betancourt Gonzalez wrote I suppose you can write a custom indexer, to store the data in mongodb instead of solr, I think there is an open repo on github about this. - Mensaje original - De: peterbarretto lt; peterbarretto08@ gt; Para: user

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

2013-01-30 Thread peterbarretto
/property Not sure why you see queue mode as byhost and not by domain. Did it print that in the logs ? I should have asked you this before : Are you using nutch 1.X or 2.x ? thanks, Tejas Patil On Tue, Jan 29, 2013 at 12:08 AM, peterbarretto lt; peterbarretto08@ gt;wrote: Hi

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

2013-01-30 Thread peterbarretto
mcgibbney wrote You are not getting very many URLs! On Tue, Jan 29, 2013 at 8:29 PM, peterbarretto lt; peterbarretto08@ gt;wrote: 2013-01-29 08:44:35,014 INFO crawl.CrawlDbReader - TOTAL urls: 96404 2013-01-29 08:44:35,018 INFO crawl.CrawlDbReader - status 1 (db_unfetched): 85672

Re: How to get page content of crawled pages

2013-02-01 Thread peterbarretto
seem to be getting one issue with javac On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto lt; peterbarretto08@ gt;wrote: C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18: error: MongodbWriter is not abstract and does not override abstract method delete(String

Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

2013-02-01 Thread peterbarretto
, 2013 at 8:06 PM, Lewis John Mcgibbney lewis.mcgibbney@ wrote: You are not getting very many URLs! On Tue, Jan 29, 2013 at 8:29 PM, peterbarretto lt; peterbarretto08@ gt; wrote: 2013-01-29 08:44:35,014 INFO crawl.CrawlDbReader - TOTAL urls: 96404 2013-01-29 08:44:35,018 INFO

Re: How to get page content of crawled pages

2013-02-08 Thread peterbarretto
and will hopefully have patches for Nutch trunk cooked up for tomorrow. I'll update this thread likewise. Thanks Lewis On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto lt; peterbarretto08@ gt; wrote: Hi Lewis, I am new to java and i dont know how to inherit all public methods from NutchIndexWriter

Re: How to get page content of crawled pages

2013-02-10 Thread peterbarretto
, peterbarretto lt; peterbarretto08@ gt;wrote: Hi Lewis, I managed to get the code working by adding the below function to MongodbWriter.java in the public class MongodbWriter implements NutchIndexWriter :- public void delete(String key) throws IOException{ return

Re: How to get page content of crawled pages

2013-02-15 Thread peterbarretto
Hi Lewis, Is this patch done?? lewis john mcgibbney wrote Hi, Once I get access to my office I am going to build the patches from trunk. Is it trunk that you are using? Thanks Lewis On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto lt; peterbarretto08@ gt;wrote: Hi Lewis, I managed

Re: How to get page content of crawled pages

2013-02-17 Thread peterbarretto
, peterbarretto lt; peterbarretto08@ gt;wrote: Hi Lewis, Is this patch done?? lewis john mcgibbney wrote Hi, Once I get access to my office I am going to build the patches from trunk. Is it trunk that you are using? Thanks Lewis On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto

Re: How to get page content of crawled pages

2013-02-18 Thread peterbarretto
the crawled urls to the mongodb. I can get the html content of crawled urls from the readseg -dump command in nutch 1.6 so i guess it will be possible to get full html along with just the text part? lewis john mcgibbney wrote Hi Peter On Saturday, February 16, 2013, peterbarretto lt

Re: How to get page content of crawled pages

2013-04-02 Thread peterbarretto
Hi Lewis, I tried applying the patch on 2.1 but it gives the below error: patching file pom.xml patching file ivy/ivy.xml Hunk #1 succeeded at 34 with fuzz 2 (offset 4 lines). patching file src/bin/nutch Hunk #1 FAILED at 61. Hunk #2 succeeded at 220 with fuzz 2 (offset 2 lines). 1 out of 2 hunks