Re: readseg bug?

2007-05-17 Thread Florent Gluck
Thank you for the explanation.  It was a bit confusing at first, but it 
actually makes sense.


Florent

Doğacan Güney wrote:

Hi,

On 5/17/07, Florent Gluck [EMAIL PROTECTED] wrote:

Hi all,

I've noticed that when doing a segment dump using readseg, several
instances of the same CrawlDatum can be present in a given record.
For example I have a segment with one single url (http://www.moma.org)
and here is the dump below.  I ran the following command:  nutch readseg
-dump segments/20070517113941 segdump -nocontent -noparsedata 
-noparsetext


With this command, readseg reads from crawl_{fetch,generate,parse}.



Here is the first record:

Recno:: 0
URL:: http://www.moma.org/

CrawlDatum::
Version: 5
Status: 1 (db_unfetched)
Fetch time: Thu May 17 11:39:34 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null
Metadata: _ngt_:1179416381663


This one is from crawl_generate, you can see that it contains a _ngt_
field. This datum is read by fetcher.



CrawlDatum::
Version: 5
Status: 65 (signature)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 0.0 days
Score: 1.0
Signature: fe47b3db7c988541287fc6412ce0b923
Metadata: null


This one is from crawl_parse. It contains signature of the parse text
which is used to dedup after index.



CrawlDatum::
Version: 5
Status: 33 (fetch_success)
Fetch time: Thu May 17 11:39:49 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: fe47b3db7c988541287fc6412ce0b923
Metadata: _ngt_:1179416381663 _pst_:success(1), lastModified=0



This is from crawl_fetch.


Why are there 3 CrawlDatum fields?
I assumed there would be only one CrawlDatum with status 33 
(fetch_success).

What is the purpose of the other two?

Now, here is the 5th record:

Recno:: 5
URL:: http://www.moma.org/application/x-shockwave-flash

CrawlDatum::
Version: 5
Status: 67 (linked)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.03846154
Signature: null
Metadata: null

CrawlDatum::
Version: 5
Status: 67 (linked)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.03846154
Signature: null
Metadata: null

CrawlDatum::
Version: 5
Status: 67 (linked)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.03846154
Signature: null
Metadata: null

CrawlDatum::
Version: 5
Status: 67 (linked)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.03846154
Signature: null
Metadata: null

CrawlDatum::
Version: 5
Status: 67 (linked)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.03846154
Signature: null
Metadata: null

CrawlDatum::
Version: 5
Status: 67 (linked)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.03846154
Signature: null
Metadata: null


In this case, a linked status indicates an outlink. Most likely your
url (http://www.moma.org) contains six distinct outlinks to
http://www.moma.org/application/x-shockwave-flash. Each of them is put
as a seperate entity to crawl_parse. This is used in updatedb to
(among other things) calculate score.




There are 6 CrawlDatum fields and all of them are exactly identical.
Is this a bug or am I missing something here?

Any light on this matter would be greatly appreciated.
Thank you.

Florent








Re: Buggy fetchlist' urls

2006-03-14 Thread Florent Gluck
Hi Andrzej,

Well, I think for now I'll just disable the parse-js plugin since I
don't really need it anyway.
I'll let you know if I ever work on it (I may need it in the future).

Thanks,
--Flo

Andrzej Bialecki wrote:

 Florent Gluck wrote:

 Some urls are totally bogus.  I didn't investigate what could be causing
 this yet, but it looks like it could be a parsing issue.  Some urls
 contain some javascript code and others contain some html tags.
   


 This is a side-effect of our primitive parse-js, which doesn't really
 parse anything, just uses some heuristic to extract possible URLs.
 Unfortunately, often as not the strings it extracts don't have
 anything to do with URLs.

 If you have suggestions on how to improve it I'm all ears.




Buggy fetchlist' urls

2006-03-13 Thread Florent Gluck
Hi,

I'm using nutch revision 385671 from the trunk.  I'm running it on a
single machine using the local fileystem.
I just started with a seed of one single url: http://www.osnews.com
Then I ran a crawl cycle of depth 2 (generate/fetch/updatedb) and
dumpped the crawl db.  Here is where I got quite surprised:

[EMAIL PROTECTED]:~/tmp$ nutch readdb crawldb -dump dump
[EMAIL PROTECTED]:~/tmp$ grep ^http dump/part-0
http://a.ads.t-online.de/   Version: 4
http://a.as-eu.falkag.net/  Version: 4
http://a.as-rh4.falkag.net/ Version: 4
http://a.as-rh4.falkag.net/server/asldata.jsVersion: 4
http://a.as-test.falkag.net/Version: 4
http://a.as-us.falkag.net/  Version: 4
http://a.as-us.falkag.net/dat/bfx/  Version: 4
http://a.as-us.falkag.net/dat/bgf/  Version: 4
http://a.as-us.falkag.net/dat/bgf/trpix.gif;Version: 4
http://a.as-us.falkag.net/dat/bjf/  Version: 4
http://a.as-us.falkag.net/dat/brf/  Version: 4
http://a.as-us.falkag.net/dat/cjf/  Version: 4
http://a.as-us.falkag.net/dat/cjf/00/13/60/94.jsVersion: 4
http://a.as-us.falkag.net/dat/cjf/00/13/60/96.jsVersion: 4
http://a.as-us.falkag.net/dat/dlv/);QQt.document.write( Version: 4
http://a.as-us.falkag.net/dat/dlv/);document.write( Version: 4
http://a.as-us.falkag.net/dat/dlv/+((QQPc-QQwA)/1000)+  Version: 4
http://a.as-us.falkag.net/dat/dlv/.ads.t-online.de  Version: 4
http://a.as-us.falkag.net/dat/dlv/.as-eu.falkag.net Version: 4
http://a.as-us.falkag.net/dat/dlv/.as-rh4.falkag.netVersion: 4
http://a.as-us.falkag.net/dat/dlv/.as-us.falkag.net Version: 4
http://a.as-us.falkag.net/dat/dlv/://   Version: 4
http://a.as-us.falkag.net/dat/dlv//bbr  Version: 4
http://a.as-us.falkag.net/dat/dlv//big/bbrVersion: 4
http://a.as-us.falkag.net/dat/dlv//center/td/tr/table/body/html
Version: 4
http://a.as-us.falkag.net/dat/dlv//divVersion: 4
http://a.as-us.falkag.net/dat/dlv/Banner-Typ/PopUp  Version: 4
http://a.as-us.falkag.net/dat/dlv/ShockwaveFlash.ShockwaveFlash.   
Version: 4
http://a.as-us.falkag.net/dat/dlv/afxplay.jsVersion: 4
http://a.as-us.falkag.net/dat/dlv/application/x-shockwave-flash Version: 4
http://a.as-us.falkag.net/dat/dlv/aslmain.jsVersion: 4
http://a.as-us.falkag.net/dat/dlv/text/javascript   Version: 4
http://a.as-us.falkag.net/dat/dlv/window.blur();Version: 4
http://a.as-us.falkag.net/dat/njf/  Version: 4
http://bilbo.counted.com/0/42699/   Version: 4
http://bilbo.counted.com/7/42699/   Version: 4
http://bw.ads.t-online.de/  Version: 4
http://bw.as-eu.falkag.net/ Version: 4
http://bw.as-us.falkag.net/ Version: 4
http://data.as-us.falkag.net/server/asldata.js  Version: 4
http://denux.org/   Version: 4
...

Some urls are totally bogus.  I didn't investigate what could be causing
this yet, but it looks like it could be a parsing issue.  Some urls
contain some javascript code and others contain some html tags.

Is there anyone aware of this?
I can open a bug if needed.

Thanks,
--Flo


Re: Error while indexing (mapred)

2006-02-14 Thread Florent Gluck
Chris,

I bumpped the maximum number of open file descriptors to 32k, but still
no luck:

...
060214 062901  reduce 9%
060214 062905  reduce 10%
060214 062908  reduce 11%
060214 062911  reduce 12%
060214 062914  reduce 11%
060214 062917  reduce 10%
060214 062918  reduce 9%
060214 062919  reduce 10%
060214 062923  reduce 9%
060214 062924  reduce 10%
Exception in thread main java.io.IOException: Job failed!
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:310)
at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:329)
at
org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:349)

Exactly the same error messages as before.
I guess I'll take my chances with the latest revision in trunk and try
again :-/

--Florent

Chris Schneider wrote:

Florent,

You might want to try increasing the number of open files allowed on your 
master machine. We've increased this twice now, and each time it solved 
similar problems. We now have it at 16K. See my other post today (re: Corrupt 
NDFS?) for more details.

Good Luck,

- Chris

At 11:07 AM -0500 2/10/06, Florent Gluck wrote:
  

Hi,

I have 4 boxes (1 master, 3 slaves), about 33GB worth of segment data
and 4.6M fetched urls in my crawldb.  I'm using the mapred code from
trunk  (revision 374061, Wed, 01 Feb 2006).
I was able to generate the indexes from the crawldb and linkdb, but I
started to see this error recently while  running a dedup on my indexes:


060210 061707  reduce 9%
060210 061710  reduce 10%
060210 061713  reduce 11%
060210 061717  reduce 12%
060210 061719  reduce 11%
060210 061723  reduce 10%
060210 061725  reduce 11%
060210 061726  reduce 10%
060210 061729  reduce 11%
060210 061730  reduce 9%
060210 061732  reduce 10%
060210 061736  reduce 11%
060210 061739  reduce 12%
060210 061742  reduce 10%
060210 061743  reduce 9%
060210 061745  reduce 10%
060210 061746  reduce 100%
Exception in thread main java.io.IOException: Job failed!
 at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:310)
 at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:329)
 at
org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:349)

I can see a lot of these messages in the jobtracker log on the master:
...
060210 061743 Task 'task_r_4t50k4' has been lost.
060210 061743 Task 'task_r_79vn7i' has been lost.
...

On every single slave, I get this file not found exception in the
tasktracker log:
060210 061749 Server handler 0 on 50040 caught:
java.io.FileNotFoundException:
/var/epile/nutch/mapred/local/task_m_273opj/part-4.out
java.io.FileNotFoundException:
/var/epile/nutch/mapred/local/task_m_273opj/part-4.out
   at
org.apache.nutch.fs.LocalFileSystem.openRaw(LocalFileSystem.java:121)  
at
org.apache.nutch.fs.NFSDataInputStream$Checker.init(NFSDataInputStream.java:45)
   at
org.apache.nutch.fs.NFSDataInputStream.init(NFSDataInputStream.java:226)
   at
org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:160)
   at
org.apache.nutch.mapred.MapOutputFile.write(MapOutputFile.java:93)
   at
org.apache.nutch.io.ObjectWritable.writeObject(ObjectWritable.java:121)
   at org.apache.nutch.io.ObjectWritable.write(ObjectWritable.java:68)
   at org.apache.nutch.ipc.Server$Handler.run(Server.java:215)

I used to be able to complete the index dedupping successfully when my
segments/crawldb was smaller, but I don't see why this would be related
to the FileNotFoundException.  I'm by far not running out of disk space
and my hard discs work properly.

Has anyone encountered a similar issue or has a clue about what's happening?

Thanks,
Florent



  




Error while indexing (mapred)

2006-02-10 Thread Florent Gluck
Hi,

I have 4 boxes (1 master, 3 slaves), about 33GB worth of segment data
and 4.6M fetched urls in my crawldb.  I'm using the mapred code from
trunk  (revision 374061, Wed, 01 Feb 2006).
I was able to generate the indexes from the crawldb and linkdb, but I
started to see this error recently while  running a dedup on my indexes:


060210 061707  reduce 9%
060210 061710  reduce 10%
060210 061713  reduce 11%
060210 061717  reduce 12%
060210 061719  reduce 11%
060210 061723  reduce 10%
060210 061725  reduce 11%
060210 061726  reduce 10%
060210 061729  reduce 11%
060210 061730  reduce 9%
060210 061732  reduce 10%
060210 061736  reduce 11%
060210 061739  reduce 12%
060210 061742  reduce 10%
060210 061743  reduce 9%
060210 061745  reduce 10%
060210 061746  reduce 100%
Exception in thread main java.io.IOException: Job failed!
  at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:310)
  at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:329)
  at
org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:349)

I can see a lot of these messages in the jobtracker log on the master:
...
060210 061743 Task 'task_r_4t50k4' has been lost.
060210 061743 Task 'task_r_79vn7i' has been lost.
...

On every single slave, I get this file not found exception in the
tasktracker log:
060210 061749 Server handler 0 on 50040 caught:
java.io.FileNotFoundException:
/var/epile/nutch/mapred/local/task_m_273opj/part-4.out
java.io.FileNotFoundException:
/var/epile/nutch/mapred/local/task_m_273opj/part-4.out
at
org.apache.nutch.fs.LocalFileSystem.openRaw(LocalFileSystem.java:121)   
at
org.apache.nutch.fs.NFSDataInputStream$Checker.init(NFSDataInputStream.java:45)
at
org.apache.nutch.fs.NFSDataInputStream.init(NFSDataInputStream.java:226)
at
org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:160)
at
org.apache.nutch.mapred.MapOutputFile.write(MapOutputFile.java:93)
at
org.apache.nutch.io.ObjectWritable.writeObject(ObjectWritable.java:121)
at org.apache.nutch.io.ObjectWritable.write(ObjectWritable.java:68)
at org.apache.nutch.ipc.Server$Handler.run(Server.java:215)

I used to be able to complete the index dedupping successfully when my
segments/crawldb was smaller, but I don't see why this would be related
to the FileNotFoundException.  I'm by far not running out of disk space
and my hard discs work properly.

Has anyone encountered a similar issue or has a clue about what's happening?

Thanks,
Florent


Re: So many Unfetched Pages using MapReduce

2006-01-23 Thread Florent Gluck
Hi Mike,

I finally got everything working properly!
What I did was to switch to /protocol-http/ and move the following from
/nutch-site.xml/ to /mapred-default.xml/:

/property
  namemapred.map.tasks/name
  value100/value
  descriptionThe default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is local.
  /description
/property

property
  namemapred.reduce.tasks/name
  value40/value
  descriptionThe default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is local.
  /description
/property/

I then injected 100'000 urls and grepped the logs on my 4 slaves to see
if the sum of all the fetched urls adds up to 100'000.  It did :)
There was finally no need to comment out line 211 of /Generator.java.

/Hope it helps,/
--/Flo

Mike Smith wrote:

Hi Florent

Thanks for the inquery and reply. I did some more tests based on your
suggestion.
Using the old protocol-http the problem is solved for single machine. But
when I have datanodes running on two other machines the problem still exist
but the number of unfetched pages is less than before. These are my tests

Injected URL: 8
only one machine is datanode: 7 fecthed pages
map tasks: 3
reduce tasks: 3
threads: 250

Injected URL: 8
3 machines are datanode. All machines are partipated in the fetching by
looking at the task tracker logs on three machines:  2 fetched pages
 map tasks: 12
reduce tasks: 6
threads: 250

Injected URL : 5000
 3 machines are datanode. All machines are partipated in the fetching by
looking at the task tracker logs on three machines:  1200 fetched pages
map tasks: 12
reduce tasks: 6
threads: 250


Injected URL : 1000
 3 machines are datanode. All machines are partipated in the fetching by
looking at the task tracker logs on three machines:  240 fetched pages

 Injected URL : 1000
 only one machine is datanode: 800 fecthed pages
 map tasks: 3
reduce tasks: 3
threads: 250

I also commented line 211 of Generator.java, but it didn't change the
situation.

I'll try to do some more testings.

Thanks, Mike

On 1/19/06, Doug Cutting [EMAIL PROTECTED] wrote:
  

Florent Gluck wrote:


I then decided to switch to using the old http protocol plugin:
protocol-http (in nutch-default.xml) instead of protocol-httpclient
With the old protocol I got 5 as expected.
  

There have been a number of complaints about unreliable fetching with
protocol-httpclient, so I've switched the default back to protocol-http.

Doug




  




Re: So many Unfetched Pages using MapReduce

2006-01-23 Thread Florent Gluck
Andrzej,

I ran 2 crawls of 1 pass each, injecting 100'000 urls.
Here is the output of /readdb -stats/ when crawling with /protocol-http/:

060123 162250 TOTAL urls:   119221
060123 162250 avg score:1.023
060123 162250 max score:240.666
060123 162250 min score:1.0
060123 162250 retry 0:  56648
060123 162250 retry 1:  62573
060123 162250 status 1 (DB_unfetched):  89068
060123 162250 status 2 (DB_fetched):27513
060123 162250 status 3 (DB_gone):   2640

And here is the output when crawling with /protocol-httpclient/:

060123 180243 TOTAL urls:   117451
060123 180243 avg score:1.021
060123 180243 max score:194.0
060123 180243 min score:1.0
060123 180243 retry 0:  52273
060123 180243 retry 1:  65178
060123 180243 status 1 (DB_unfetched):  89670
060123 180243 status 2 (DB_fetched):26066
060123 180243 status 3 (DB_gone):   1715

Both return more or less the same results (w/ a difference of ~1.5% in
the #fetches which is not surprising on a 100k set).
I checked the logs and in the 2 cases, I see exactly 100'000 fetch attempts.
You were right, it actually makes sense that the settings in
/mapred-default.xml/ would affect the local crawl as well since they
have nothing to do w/ ndfs.
It therefore seems that /protocol-httpclient/ is reliable enough to be
used (well, at least in my case).

--Flo

Florent Gluck wrote:

Andrzej Bialecki wrote:

  

Could you please check (on a smaller sample ;-) ) which of these two
changes was necessary? Frist, second, or both? I suspect only the
second change was really needed, i.e. the change in config files, and
not the change of protocol-httpclient - protocol-http ... It would be
very helpful if you could confirm/deny this.



Well, I'm pretty much sure protocol-httpclient is part of the problem. 
Earlier last week, I was trying to figure out what the problem was and I
ran some crawls on single machine, using the local filesystem.  Here
were my previous observations (from an older message):

I injected 5 urls and got 2315 urls fetched.  I couldn't find a
trace in the logs of most of the urls.
I noticed that if I put a counter at the beginning of the
/while(true)/** loop in the method /run/ in /Fetcher.java,/ I don't
end up with 5!
After some poking around, I noticed that if I comment out the line doing
the page fetch /ProtocolOutput output = protocol.getProtocolOutput(key,
datum);/, then I get 5.
There seems to be something really wrong with that.  I seems to mean
that some threads are dying without notification in the http protocol
code (if it makes any sense).
I then decided to switch to using the old http protocol plugin:
protocol-http (in nutch-default.xml) instead of protocol-httpclient.
With the old protocol I got 5 as expected.


So to me it seems protocol-httpclient is buggy.  I'll still run a test
with my current config and protocol-httpclient and let you know.
-Flo

  




Re: So many Unfetched Pages using MapReduce

2006-01-19 Thread Florent Gluck
Hi Mike,

Your differents tests are really interesting, thanks for sharing!
I didn't do as many tests. I changed the number of fetch threads and the
number of map and reduce tasks and noticed that it gave me quite
different results in terms of pages fetched.
Then, I wanted to see if this issue would still happen when running the
crawl (single pass) on one single machine running everything locally,
without ndfs.
So I injected 5 urls and got 2315 urls fetched.  I couldn't find a
trace in the logs of most of the urls.
I noticed that if I put a counter at the beginning of the
/while(true)/** loop in the method /run/ in /Fetcher.java,/ I don't
end up with 5!
After some poking around, I noticed that if I comment out the line doing
the page fetch /ProtocolOutput output = protocol.getProtocolOutput(key,
datum);/, then I get 5.
There seems to be something really wrong with that.  I seems to mean
that some threads are dying without notification in the http protocol
code (if it makes any sense).
I then decided to switch to using the old http protocol plugin:
protocol-http (in nutch-default.xml) instead of protocol-httpclient
With the old protocol I got 5 as expected.

The following bug seems to be very similar to what we are encountering:
http://issues.apache.org/jira/browse/NUTCH-136
Check out the latest comment.  I'm gonna remove line 211 and run some
tests to see how it behaves (with protocol-http and protocol-httpclient).

I'll let you know what I find out,
--Florent

Mike Smith wrote:

Hi Florent

I did some more testings. Here is the results:

I have 3 machines, P4 and 1G ram. All three are data node and one is
namenode. I started from 8 seed urls and tried to see the effect of
depth 1 crawl for different configuration.

Number of unfetch pages changes with different configurations:

--Configuration 1
Number of map tasks: 3
Number of reduce tasks: 3
Number of fetch threads: 40
Number of thread per host: 2
http.timeout: 10 sec
---
6700 pages fetched

--Configuration 2
Number of map tasks: 12
Number of reduce tasks: 6
Number of fetch threads: 500
Number of thread per host: 20
http.timeout: 10 sec
---
18000 pages fetched

--Configuration 3
Number of map tasks: 40
Number of reduce tasks: 20
Number of fetch threads: 500
Number of thread per host: 20
http.timeout: 10 sec
---
37000 pages fetched

--Configuration 4
Number of map tasks: 100
Number of reduce tasks: 20
Number of fetch threads: 100
Number of thread per host: 20
http.timeout: 10 sec
---
34000 pages fetched


--Configuration 5
Number of map tasks: 50
Number of reduce tasks: 50
Number of fetch threads: 40
Number of thread per host: 100
http.timeout: 20 sec
---
52000 pages fetched

--Configuration 6
Number of map tasks: 50
Number of reduce tasks: 100
Number of fetch threads: 40
Number of thread per host: 100
http.timeout: 20 sec
---
57000 pages fetched

--Configuration 7
Number of map tasks: 50
Number of reduce tasks: 120
Number of fetch threads: 250
Number of thread per host: 20
http.timeout: 20 sec
---
6 pages fetched



Do you have any idea why pages are missing from the fetcher without the any
log or exceptions? It seems it really depends on the number of reduce
tasks!
Thanks, Mike
  




Re: Error at end of MapReduce run with indexing

2006-01-17 Thread Florent Gluck
Ken Krugler wrote:

 Hello fellow Nutchers,

 I followed the steps described here by Doug:
  
 http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200509.mbox/[EMAIL
  PROTECTED]


 ...to start a test run of the new (0.8, as of 1/12/2006) version of
 Nutch.

 It ran for quite a while on my three machines - started at 111226, and
 died at 150937, so almost four hours.

 The error occurred during the Indexer phase:

 060114 150937 Indexer: starting
 060114 150937 Indexer: linkdb: crawl-20060114111226/linkdb
 060114 150937 Indexer: adding segment:
 /user/crawler/crawl-20060114111226/segments/20060114111918
 060114 150937 parsing file:/home/crawler/nutch/conf/nutch-default.xml
 060114 150937 parsing file:/home/crawler/nutch/conf/crawl-tool.xml
 060114 150937 parsing file:/home/crawler/nutch/conf/mapred-default.xml
 060114 150937 parsing file:/home/crawler/nutch/conf/mapred-default.xml
 060114 150937 parsing file:/home/crawler/nutch/conf/nutch-site.xml
 060114 150937 Indexer: adding segment:
 /user/crawler/crawl-20060114111226/segments/20060114122751
 060114 150937 Indexer: adding segment:
 /user/crawler/crawl-20060114111226/segments/20060114133620
 Exception in thread main java.io.IOException: timed out waiting for
 response
 at org.apache.nutch.ipc.Client.call(Client.java:296)
 at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
 at $Proxy1.submitJob(Unknown Source)
 at
 org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
 at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
 at org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)

 1. Any ideas what might have caused it to time out just now, when it
 had successfully run many jobs up to that point?

 2. What cruft might I need to get rid of because it died? For example,
 I see a reference to /home/crawler/tmp/local/jobTracker/job_18cunz.xml
 now when I try to execute some Nutch commands.

I've had the same problem during the invertlinks step when dealing w/ a
large number of urls.  Increasing the ipc.client.timeout value from
6  to 10 (cf nutch-default.xml) did the trick.


 3. What's the best way to find out how many pages were actually
 crawled, how many links are in the DB, etc? The 0.7-era commands
 (readdb, segread, etc) don't seem to be working with the new NDFS setup.

The following gives you some stats about the crawl db (#url fetched,
unfetched and dead ones):
nutch readdb crawldb -stats


 4. Any idea whether 4 hours is a reasonable amount of time for this
 test? It seemed long to me, given that I was starting with a single
 URL as the seed.

How many crawl passes did you do ?

--Flo


Re: So many Unfetched Pages using MapReduce

2006-01-17 Thread Florent Gluck
I'm having the exact same problem.
I noticed that changing the number of map/reduce tasks gives me
different DB_fetched results.
Looking at the logs, a lot of urls are actually missing.  I can't find
their trace *anywhere* in the logs (whether on the slaves or the
master).  I'm puzzled.  Currently I'm trying to debug the code to see
what's going on.
So far, I noticed the generator is fine, so the issue must lay further
in the pipeline (fetcher?).

Let me know if you find anything regarding this issue. Thanks.

--Flo

Mike Smith wrote:

Hi,

I have setup for boxes using MapReduce, everything goes smoothly, I have
feeded about 8 seed nodes for begining and I have crawled by depth 2.
Only 1900 pages (about 300MG) data and the rest is marked and db unfetched.
Does any one know what could be wrong?

This is the output of (bin/nutch readdb h2/crawldb -stats):

060115 171625 Statistics for CrawlDb: h2/crawldb
060115 171625 TOTAL urls:   99403
060115 171625 avg score:1.01
060115 171625 max score:7.382
060115 171625 min score:1.0
060115 171625 retry 0:  99403
060115 171625 status 1 (DB_unfetched):  97470
060115 171625 status 2 (DB_fetched):1933
060115 171625 CrawlDb statistics: done

Thanks,
Mike

  




mapred fetching weirdness

2006-01-10 Thread Florent Gluck
Hi,

I'm running nutch trunk as of today.  I have 3 slaves and a master.  I'm
using *mapred.map.tasks=20* and *mapred.reduce.tasks=4*
There is something I'm really confused about.

When I inject 25000 urls and fetch them (depth = 1) and do a readdb
-stats, I get:
060110 171347 Statistics for CrawlDb: crawldb
060110 171347 TOTAL urls:   27939
060110 171347 avg score:1.011
060110 171347 max score:8.883
060110 171347 min score:1.0
060110 171347 retry 0:  26429
060110 171347 retry 1:  1510
060110 171347 status 1 (DB_unfetched):  24248
060110 171347 status 2 (DB_fetched):3390
060110 171347 status 3 (DB_gone):   301
060110 171347 CrawlDb statistics: done

There are several things that don't make sense to me and it would be
great if someone could clear this up:

1.
If I compute the number of occurences of fetching in all of my slaves'
tasktracker logs, I get: 6225
This number clearly doesn't match the *DB_fetched* of 3390 from the
readdb output.  Why is that ?
What happened to the 6225-3390=2835 urls missing ?

2.
Why is the *TOTAL urls: 27939* if I inject a file with 25000 entries ?
Why is it not 25000 ?

3.
What is the meaning of *DB_gone* and *DB_unfetched* ?
I was assuming if you inject a total of 25k urls where 5000 are
fetchable ones, you would get something like:
(DB_unfetched):  2
(DB_fetched):5000
It's not the case, so I'd like to understand what's exactly going on here.

4.
If I redo (starting from an empty crawldb of course) the exact same
inject + crawl with the same 25000 urls, but I use the following mapred
settings instead: *mapred.map.tasks=200* and *mapred.reduce.tasks=8*, I
get the following readdb output:
060110 162140 TOTAL urls:   33173
060110 162140 avg score:1.026
060110 162140 max score:22.083
060110 162140 min score:1.0
060110 162140 retry 0:  28381
060110 162140 retry 1:  4792
060110 162140 status 1 (DB_unfetched):  23136
060110 162140 status 2 (DB_fetched):9234
060110 162140 status 3 (DB_gone):   803
060110 162140 CrawlDb statistics: done
How come the *DB_fetched *is about 3x more and the *TOTAL urls *goes
beyond 25000 ???
It doesn't make any sense.  I'd expect to see similar results as before
with the other mapred settings.

Thank you,
Florent


Re: is nutch recrawl possible?

2005-12-19 Thread Florent Gluck
Pushpesh,

We extended nutch with a whitelist filter and you might find it useful. 
Check the comments from Matt Kangas here:
http://issues.apache.org/jira/browse/NUTCH-87;jsessionid=6F6AD5423357184CF57B51B003201C49?page=all

--Flo

Pushpesh Kr. Rajwanshi wrote:

hmmm... actually my requirement is a bit more complex than it seems so url
filters alone probably would do. Because i am not filtering urls based only
on some domain name but within domain i want to discard some urls, and since
they actually dont follow a pattern hence i cant use url filters otherwise
url filters would have done great job.

Thanks anyway
Pushpesh


On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote:
  

About this blocking you can try to use the urlfilters, change the
filter between each  fetch/generate

+^http://www.abc.com

-^http://www.bbc.co.uk


Pushpesh Kr. Rajwanshi wrote:



Oh this is pretty good and quite helpful material i wanted. Thanks Havard
for this. Seems like this will help me writing code for stuff i need :-)

Thanks and Regards,
Pushpesh



On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote:


  

Try using the whole-web fetching method instead of the crawl method.

http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling

http://wiki.media-style.com/display/nutchDocu/quick+tutorial


Pushpesh Kr. Rajwanshi wrote:





Hi Stefan,

Thanks for lightening fast reply. I was amazed to see such quick
  

response


really appreciate it.

Actually what i am really looking is, suppose i run a crawl for
  

sometime


sites say 5 and for some depth say 2. Then what i want is next time i
  

run


  

a




crawl it should re use the webdb contents which it populated first
  

time.


(Assuming a successful crawl. Yea you are right a suddenly broken down


  

crawl




wont work as it has lost its integrity of data)

As you said we can run tools provided by nutch to do step by step


  

commands




needed to crawl, but isnt there some way i can reuse the existing crawl
data? May be it involves changing code but thats ok. Just one more
  

quick


question, why every crawl needs a new directory and there isnt an
  

option


  

to




alteast reuse the webdb? May be i am asking something silly but i am
clueless :-(

Or as you said may be what i can do is to explore the steps u mentioned


  

and




get what i need.

Thanks again,
Pushpesh


On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote:




  

It is difficult to answer your question since the used vocabulary is
may wrong.
You can refetch pages, no problem. But you can not continue a crashed
fetch process.
Nutch provides a tool that runs a set of steps like, segment
generation, fetching, db updateting etc.
So may first try to run these steps manually instead of using the
crawl command.
Than you may will already get an idea where you can jump in to grep
your needed data.

Stefan

Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:







Hi,

I am crawling some sites using nutch. My requirement is, when i run
a nutch
crawl, then somehow it should be able to reuse the data in webdb
populated
in previous crawl.

In other words my question is suppose my crawl is running and i
cancel it
somewhere in middle, then is there someway i can resume the crawl ?


I dont know even if i can do this at all or if there is some way
then please
throw some light on this.

TIA

Regards,
Pushpesh




  





No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date:


  

16.12.2005




  






No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date:
  

16.12.2005


  




  




java.io.IOException in dedup (map reduce)

2005-12-15 Thread Florent Gluck
Hi,

I'm using the map reduce branch, 1 master and 3 slaves, and they are
configured the standard way (master as a jobtracker + namenode)
After having created a index, I run dedup on it, but I get a
IOException.  Here is an extract of the log:

051215 160733 Dedup: starting
051215 160733 Dedup: adding indexes in: index
051215 160734 parsing file:/home/epile/nutch-0.8-dev/conf/nutch-default.xml
051215 160734 parsing file:/home/epile/nutch-0.8-dev/conf/mapred-default.xml
051215 160734 parsing file:/home/epile/nutch-0.8-dev/conf/nutch-site.xml
051215 160734 parsing file:/home/epile/nutch-0.8-dev/conf/nutch-default.xml
051215 160734 parsing file:/home/epile/nutch-0.8-dev/conf/nutch-site.xml
051215 160734 Client connection to 127.0.0.1:4: starting
051215 160734 Client connection to 127.0.0.1:5: starting
051215 160735 Running job: job_a4k3en
051215 160736  map 0%
051215 160753  map 17%
051215 160803  reduce 100%
Exception in thread main java.io.IOException: Job failed!
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
at
org.apache.nutch.crawl.DeleteDuplicates.dedup(DeleteDuplicates.java:3 12)
at
org.apache.nutch.crawl.DeleteDuplicates.main(DeleteDuplicates.java:34 9)

I also ran the exact same crawl locally on one single machine w/o using
ndfs and it works fine.

Has anyone else encountered this exception ?

Thanks,
--Flo


nutch mapred + tomcat and a couple other questions

2005-12-12 Thread Florent Gluck
Sorry for such a basic question, but how do we run a search on a
generated index ?
I read about how to setup tomcat w/ nutch  0.8 and you have to run it
in the directory where the index resides (apparently it looks in the
segments dir from where it's run).  However this wont' work w/ nutch 0.8
since the index is in ndfs.  I then tried to copy it locally using ndfs
-copyToLocal, but without any luck.
Also, how to use nutch search ?
Is NutchBean the closest thing to segread from nutch  0.8 ?
Finally, what's the purpose of nutch parse and how to use it ?

Thanks,
--Flo


Re: nutch mapred + tomcat and a couple other questions

2005-12-12 Thread Florent Gluck
My err, I meant nutch server not nutch search

--Flo

Florent Gluck wrote:

Sorry for such a basic question, but how do we run a search on a
generated index ?
I read about how to setup tomcat w/ nutch  0.8 and you have to run it
in the directory where the index resides (apparently it looks in the
segments dir from where it's run).  However this wont' work w/ nutch 0.8
since the index is in ndfs.  I then tried to copy it locally using ndfs
-copyToLocal, but without any luck.
Also, how to use nutch search ?
Is NutchBean the closest thing to segread from nutch  0.8 ?
Finally, what's the purpose of nutch parse and how to use it ?

Thanks,
--Flo

  




Re: nutch mapred + tomcat and a couple other questions

2005-12-12 Thread Florent Gluck
Never mind, I got tomcat working.
After looking at the code, it seems nutch parse does nothing yet.
The last remaining thing is how to use NutchBean to output the segments'
content.

Thanks,
--Flo

Florent Gluck wrote:

Sorry for such a basic question, but how do we run a search on a
generated index ?
I read about how to setup tomcat w/ nutch  0.8 and you have to run it
in the directory where the index resides (apparently it looks in the
segments dir from where it's run).  However this wont' work w/ nutch 0.8
since the index is in ndfs.  I then tried to copy it locally using ndfs
-copyToLocal, but without any luck.
Also, how to use nutch search ?
Is NutchBean the closest thing to segread from nutch  0.8 ?
Finally, what's the purpose of nutch parse and how to use it ?

Thanks,
--Flo

  




Incremental crawl w/ map reduce

2005-12-09 Thread Florent Gluck
Hi,

As a test, I recently did a quick incremental crawl.  First, I did a
crawl with 10 seed urls using 4 nodes (1 jobTracker/nameNode + 3
tastTrackers/dataNodes).  So far, so good, the fetches were distributed
among the 3 nodes (3/3/4) and a segment was generated.  Running a quick
-stats on the crawldb showed me the 10 links were there.  I also did a
dump and everything was fine.
Then, I injected a new url and crawled again, generating a second segment.
While it was running, I looked at the logs expecting to only see the
fetch of the new url I added, but instead I saw it was fetching all the
previous urls again.
Why is that ?  These were already fetched and my understanding is that
they should only be fetched again after 30 days (or whatever value is
specified in nutch-site.xml).
What am I missing here ?

Thanks,
Flo