Re: readseg bug?

2007-05-17 Thread Florent Gluck
Thank you for the explanation. It was a bit confusing at first, but it actually makes sense. Florent Doğacan Güney wrote: Hi, On 5/17/07, Florent Gluck [EMAIL PROTECTED] wrote: Hi all, I've noticed that when doing a segment dump using readseg, several instances of the same CrawlDatum can

Re: Buggy fetchlist' urls

2006-03-14 Thread Florent Gluck
Hi Andrzej, Well, I think for now I'll just disable the parse-js plugin since I don't really need it anyway. I'll let you know if I ever work on it (I may need it in the future). Thanks, --Flo Andrzej Bialecki wrote: Florent Gluck wrote: Some urls are totally bogus. I didn't investigate

Buggy fetchlist' urls

2006-03-13 Thread Florent Gluck
Hi, I'm using nutch revision 385671 from the trunk. I'm running it on a single machine using the local fileystem. I just started with a seed of one single url: http://www.osnews.com Then I ran a crawl cycle of depth 2 (generate/fetch/updatedb) and dumpped the crawl db. Here is where I got quite

Re: Error while indexing (mapred)

2006-02-14 Thread Florent Gluck
master machine. We've increased this twice now, and each time it solved similar problems. We now have it at 16K. See my other post today (re: Corrupt NDFS?) for more details. Good Luck, - Chris At 11:07 AM -0500 2/10/06, Florent Gluck wrote: Hi, I have 4 boxes (1 master, 3 slaves), about

Error while indexing (mapred)

2006-02-10 Thread Florent Gluck
Hi, I have 4 boxes (1 master, 3 slaves), about 33GB worth of segment data and 4.6M fetched urls in my crawldb. I'm using the mapred code from trunk (revision 374061, Wed, 01 Feb 2006). I was able to generate the indexes from the crawldb and linkdb, but I started to see this error recently while

Re: So many Unfetched Pages using MapReduce

2006-01-23 Thread Florent Gluck
of Generator.java, but it didn't change the situation. I'll try to do some more testings. Thanks, Mike On 1/19/06, Doug Cutting [EMAIL PROTECTED] wrote: Florent Gluck wrote: I then decided to switch to using the old http protocol plugin: protocol-http (in nutch-default.xml) instead

Re: So many Unfetched Pages using MapReduce

2006-01-23 Thread Florent Gluck
affect the local crawl as well since they have nothing to do w/ ndfs. It therefore seems that /protocol-httpclient/ is reliable enough to be used (well, at least in my case). --Flo Florent Gluck wrote: Andrzej Bialecki wrote: Could you please check (on a smaller sample ;-) ) which of these two

Re: So many Unfetched Pages using MapReduce

2006-01-19 Thread Florent Gluck
Hi Mike, Your differents tests are really interesting, thanks for sharing! I didn't do as many tests. I changed the number of fetch threads and the number of map and reduce tasks and noticed that it gave me quite different results in terms of pages fetched. Then, I wanted to see if this issue

Re: Error at end of MapReduce run with indexing

2006-01-17 Thread Florent Gluck
Ken Krugler wrote: Hello fellow Nutchers, I followed the steps described here by Doug: http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200509.mbox/[EMAIL PROTECTED] ...to start a test run of the new (0.8, as of 1/12/2006) version of Nutch. It ran for quite a while on my

Re: So many Unfetched Pages using MapReduce

2006-01-17 Thread Florent Gluck
I'm having the exact same problem. I noticed that changing the number of map/reduce tasks gives me different DB_fetched results. Looking at the logs, a lot of urls are actually missing. I can't find their trace *anywhere* in the logs (whether on the slaves or the master). I'm puzzled. Currently

mapred fetching weirdness

2006-01-10 Thread Florent Gluck
Hi, I'm running nutch trunk as of today. I have 3 slaves and a master. I'm using *mapred.map.tasks=20* and *mapred.reduce.tasks=4* There is something I'm really confused about. When I inject 25000 urls and fetch them (depth = 1) and do a readdb -stats, I get: 060110 171347 Statistics for

Re: is nutch recrawl possible?

2005-12-19 Thread Florent Gluck
Pushpesh, We extended nutch with a whitelist filter and you might find it useful. Check the comments from Matt Kangas here: http://issues.apache.org/jira/browse/NUTCH-87;jsessionid=6F6AD5423357184CF57B51B003201C49?page=all --Flo Pushpesh Kr. Rajwanshi wrote: hmmm... actually my requirement is

java.io.IOException in dedup (map reduce)

2005-12-15 Thread Florent Gluck
Hi, I'm using the map reduce branch, 1 master and 3 slaves, and they are configured the standard way (master as a jobtracker + namenode) After having created a index, I run dedup on it, but I get a IOException. Here is an extract of the log: 051215 160733 Dedup: starting 051215 160733 Dedup:

nutch mapred + tomcat and a couple other questions

2005-12-12 Thread Florent Gluck
Sorry for such a basic question, but how do we run a search on a generated index ? I read about how to setup tomcat w/ nutch 0.8 and you have to run it in the directory where the index resides (apparently it looks in the segments dir from where it's run). However this wont' work w/ nutch 0.8

Re: nutch mapred + tomcat and a couple other questions

2005-12-12 Thread Florent Gluck
My err, I meant nutch server not nutch search --Flo Florent Gluck wrote: Sorry for such a basic question, but how do we run a search on a generated index ? I read about how to setup tomcat w/ nutch 0.8 and you have to run it in the directory where the index resides (apparently it looks

Re: nutch mapred + tomcat and a couple other questions

2005-12-12 Thread Florent Gluck
Never mind, I got tomcat working. After looking at the code, it seems nutch parse does nothing yet. The last remaining thing is how to use NutchBean to output the segments' content. Thanks, --Flo Florent Gluck wrote: Sorry for such a basic question, but how do we run a search on a generated

Incremental crawl w/ map reduce

2005-12-09 Thread Florent Gluck
Hi, As a test, I recently did a quick incremental crawl. First, I did a crawl with 10 seed urls using 4 nodes (1 jobTracker/nameNode + 3 tastTrackers/dataNodes). So far, so good, the fetches were distributed among the 3 nodes (3/3/4) and a segment was generated. Running a quick -stats on the