Re: Question about crawldb and segments
Hey Doug, I'm finally picking this up again, and I believe I have everything configured as you suggested - I'm running dfs at the indexing datacenter, along with a job tracker/namenode/task trackers. I'm running a job tracker/namenode/task trackers at the crawling datacenter, and its all pointed to use dfs at the indexing datacenter. If I have mapreduce set to run locally, everything seems to be working fine. But if I set mapred.job.tracker to the job tracker that's running in the crawl datacenter, I get this when trying to run a fetch: 060430 155743 Waiting to find target node: crawl01/XX.XX.XX.XX:50010 For some reason, even though I'm specifying the fs.default.name as my other datacenter, it's trying to make itself a dfs node also. I was experimenting with some of the shell scripts, and it seems that in any configuration where this server is a task tracker, it also tries to be a dfs node. Am I missing something, or is there no way to make this server only a task tracker and not a dfs node also? Let me know if this makes sense, its even a little confusing for me, and I set it up :) Thanks! Jason Doug Cutting wrote: Jason Camp wrote: Unfortunately in our scenario, bw is cheap at our fetching datacenter, but adding additional disk capacity is expensive - so we are fetching the data and sending it back to another cluster (by exporting segments from ndfs, copy, importing). But to perform the copies, you're using a lot of bandwidth to your indexing datacenter, no? Copying segments probably takes almost as much bandwidth as fetching them... I know this sounds a bit messy, but it was the only way we could come up with to utilize the benefits of both datacenters. Ideally, I'd love to be able to have all of the servers in one cluster, and define which servers I want to perform which tasks, so for instance we could use the one group of servers to fetch the data, but the other group of servers to store the data and perform the indexing/etc. If there's a better way to do something like this than what we're doing, or if you think we're just insane for doing it this way, please let me know :) Thanks! You can use different sets of machines for dfs and MapReduce, by starting them in differently configured installations. So you could run dfs only in your indexing datacenter, and MapReduce in both datacenters configured to talk to the same dfs, at the indexing datacenter. Then your fetch tasks as the fetch datacenter would write their output to the indexing datacenter's dfs. And parse/updatedb/generate/index/etc. could all run at the other datacenter. Does that make sense? Doug
Re: Question about crawldb and segments
Sorry, I am on holiday until the 8th of May. Please contact the [EMAIL PROTECTED] for urgent matters. Kind regards, Herman.
Re: Question about crawldb and segments
Jason Camp wrote: Unfortunately in our scenario, bw is cheap at our fetching datacenter, but adding additional disk capacity is expensive - so we are fetching the data and sending it back to another cluster (by exporting segments from ndfs, copy, importing). But to perform the copies, you're using a lot of bandwidth to your indexing datacenter, no? Copying segments probably takes almost as much bandwidth as fetching them... I know this sounds a bit messy, but it was the only way we could come up with to utilize the benefits of both datacenters. Ideally, I'd love to be able to have all of the servers in one cluster, and define which servers I want to perform which tasks, so for instance we could use the one group of servers to fetch the data, but the other group of servers to store the data and perform the indexing/etc. If there's a better way to do something like this than what we're doing, or if you think we're just insane for doing it this way, please let me know :) Thanks! You can use different sets of machines for dfs and MapReduce, by starting them in differently configured installations. So you could run dfs only in your indexing datacenter, and MapReduce in both datacenters configured to talk to the same dfs, at the indexing datacenter. Then your fetch tasks as the fetch datacenter would write their output to the indexing datacenter's dfs. And parse/updatedb/generate/index/etc. could all run at the other datacenter. Does that make sense? Doug
Question about crawldb and segments
Hi, I've been using Nutch 7 for a few months, and recently started working with 8. I'm testing everything right now on a single server, using the local file system. I generated 10 segments with 100k urls in each, and fetched the content. Then I do the updatedb, but it looks like the crawldb isn't working properly. For example, I ran the updatedb command on one segment, and -stats shows this: 060409 140035 status 1 (DB_unfetched): 1732457 060409 140035 status 2 (DB_fetched):82608 060409 140035 status 3 (DB_gone): 3447 I then ran the updatedb against the next segment, and -stats now shows this: 060409 150737 status 1 (DB_unfetched): 1777642 060409 150737 status 2 (DB_fetched):81629 060409 150737 status 3 (DB_gone): 3377 Any idea why the number of fetched urls would actually go down? What I *think* is happening is that the crawldb only contains the data from the last crawl, not the subsequent crawls. Does this make sense? I ran the test doing each segment and running -stats, and they are all around 80k for fetched and 1.7m for unfetched, but the numbers dont seem to be accumulating. Since readsegs is broken in 8, I can't really get an idea of what is actually in the segments. Is there an alternative way to see how many urls are actually in the segment and fetched? If you have any ideas, please let me know. Thanks a lot! Jason