Re: Question about crawldb and segments

2006-04-30 Thread Jason Camp

Hey Doug,
   I'm finally picking this up again, and I believe I have everything 
configured as you suggested - I'm running dfs at the indexing 
datacenter, along with a job tracker/namenode/task trackers. I'm running 
a job tracker/namenode/task trackers at the crawling datacenter, and its 
all pointed to use dfs at the indexing datacenter.


If I have mapreduce set to run locally, everything seems to be working 
fine. But if I set mapred.job.tracker to the job tracker that's running 
in the crawl datacenter, I get this when trying to run a fetch:


060430 155743 Waiting to find target node: crawl01/XX.XX.XX.XX:50010

For some reason, even though I'm specifying the fs.default.name as my 
other datacenter, it's trying to make itself a dfs node also. I was 
experimenting with some of the shell scripts, and it seems that in any 
configuration where this server is a task tracker, it also tries to be a 
dfs node. Am I missing something, or is there no way to make this server 
only a task tracker and not a dfs node also?


Let me know if this makes sense, its even a little confusing for me, and 
I set it up :)


Thanks!

Jason


Doug Cutting wrote:


Jason Camp wrote:


Unfortunately in our scenario, bw is cheap at our fetching datacenter,
but adding additional disk capacity is expensive - so we are fetching
the data and sending it back to another cluster (by exporting segments
from ndfs, copy, importing).



But to perform the copies, you're using a lot of bandwidth to your 
indexing datacenter, no?  Copying segments probably takes almost as 
much bandwidth as fetching them...



I know this sounds a bit messy, but it was the only way we could
come up with to utilize the benefits of both datacenters. Ideally, I'd
love to be able to have all of the servers in one cluster, and define
which servers I want to perform which tasks, so for instance we could
use the one group of servers to fetch the data, but the other group of
servers to store the data and perform the indexing/etc. If there's a
better way to do something like this than what we're doing, or if  you
think we're just insane for doing it this way, please let me know :) 
Thanks!



You can use different sets of machines for dfs and MapReduce, by 
starting them in differently configured installations.  So you could 
run dfs only in your indexing datacenter, and MapReduce in both 
datacenters configured to talk to the same dfs, at the indexing 
datacenter.  Then your fetch tasks as the fetch datacenter would 
write their output to the indexing datacenter's dfs.  And 
parse/updatedb/generate/index/etc. could all run at the other 
datacenter.  Does that make sense?


Doug





Re: Question about crawldb and segments

2006-04-30 Thread Herman Hardenbol
Sorry, I am on holiday until the 8th of May.

Please contact the [EMAIL PROTECTED] for urgent matters.

Kind regards, Herman.



Re: Question about crawldb and segments

2006-04-12 Thread Doug Cutting

Jason Camp wrote:

Unfortunately in our scenario, bw is cheap at our fetching datacenter,
but adding additional disk capacity is expensive - so we are fetching
the data and sending it back to another cluster (by exporting segments
from ndfs, copy, importing).


But to perform the copies, you're using a lot of bandwidth to your 
indexing datacenter, no?  Copying segments probably takes almost as 
much bandwidth as fetching them...



I know this sounds a bit messy, but it was the only way we could
come up with to utilize the benefits of both datacenters. Ideally, I'd
love to be able to have all of the servers in one cluster, and define
which servers I want to perform which tasks, so for instance we could
use the one group of servers to fetch the data, but the other group of
servers to store the data and perform the indexing/etc. If there's a
better way to do something like this than what we're doing, or if  you
think we're just insane for doing it this way, please let me know :) Thanks!


You can use different sets of machines for dfs and MapReduce, by 
starting them in differently configured installations.  So you could run 
dfs only in your indexing datacenter, and MapReduce in both 
datacenters configured to talk to the same dfs, at the indexing 
datacenter.  Then your fetch tasks as the fetch datacenter would write 
their output to the indexing datacenter's dfs.  And 
parse/updatedb/generate/index/etc. could all run at the other 
datacenter.  Does that make sense?


Doug


Question about crawldb and segments

2006-04-09 Thread Jason Camp

Hi,
   I've been using Nutch 7 for a few months, and recently started 
working with 8.  I'm testing everything right now on a single server, 
using the local file system.  I generated 10 segments with 100k urls in 
each, and fetched the content. Then I do the updatedb, but it looks like 
the crawldb isn't working properly. For example, I ran the updatedb 
command on one segment, and -stats shows this:


060409 140035 status 1 (DB_unfetched):  1732457
060409 140035 status 2 (DB_fetched):82608
060409 140035 status 3 (DB_gone):   3447

I then ran the updatedb against the next segment, and -stats now shows this:

060409 150737 status 1 (DB_unfetched):  1777642
060409 150737 status 2 (DB_fetched):81629
060409 150737 status 3 (DB_gone):   3377


Any idea why the number of fetched urls would actually go down? What I 
*think* is happening is that the crawldb only contains the data from the 
last crawl, not the subsequent crawls. Does this make sense? I ran the 
test doing each segment and running -stats, and they are all around 80k 
for fetched and 1.7m for unfetched, but the numbers dont seem to be 
accumulating.


Since readsegs is broken in 8, I can't really get an idea of what is 
actually in the segments. Is there an alternative way to see how many 
urls are actually in the segment and fetched?


If you have any ideas, please let me know. Thanks a lot!

Jason