Re: Nutch Hadoop question
Hi All, Don't want to bother you guys too much... I've tried searching for this topic and do some testing myself but so far was quite unsuccessful. Basically - I wish to use some computers only for map-reduce processing and not for HDFS, does anyone know how this can be done? Thanks, Eran On Wed, Nov 11, 2009 at 12:19 PM, Eran Zinman zze...@gmail.com wrote: Hi All, I'm using Nutch with Hadoop with great pleasure - working great and really increase crawling performance on multiple machines. I have two strong machines and two older machines which I would like to use. So far I've been using only the two strong machines with Hadoop. Now I would like to add the two less powerful machines to do some processing as well. My question is - Right now the HDFS is shared between the two powerful computers. I don't want the two other computer to store any content on them as they have a slow and unreliable harddisk. I just want the two other machines to do processing (i.e. mapreduce) and not store any content on them. Is that possible - or do I have to use HDFS on all machines that do processing? If it's possible to use a machine only for mapreduce - how this is done? Thank you for your help, Eran
Re: Nutch Hadoop question
Hi Eran, mapreduce has to store its data on HDFS file system. But if you want to separate the two groups of servers, you could build two separate HDFS filesystems. To separate the two setups, you will need to make sure there is no cross communication between the two parts, cheer Alex Eran Zinman wrote: Hi All, Don't want to bother you guys too much... I've tried searching for this topic and do some testing myself but so far was quite unsuccessful. Basically - I wish to use some computers only for map-reduce processing and not for HDFS, does anyone know how this can be done? Thanks, Eran On Wed, Nov 11, 2009 at 12:19 PM, Eran Zinman zze...@gmail.com wrote: Hi All, I'm using Nutch with Hadoop with great pleasure - working great and really increase crawling performance on multiple machines. I have two strong machines and two older machines which I would like to use. So far I've been using only the two strong machines with Hadoop. Now I would like to add the two less powerful machines to do some processing as well. My question is - Right now the HDFS is shared between the two powerful computers. I don't want the two other computer to store any content on them as they have a slow and unreliable harddisk. I just want the two other machines to do processing (i.e. mapreduce) and not store any content on them. Is that possible - or do I have to use HDFS on all machines that do processing? If it's possible to use a machine only for mapreduce - how this is done? Thank you for your help, Eran
Re: Nutch Hadoop question
TuxRacer69 wrote: Hi Eran, mapreduce has to store its data on HDFS file system. More specifically, it needs read/write access to a shared filesystem. If you are brave enough you can use NFS, too, or any other type of filesystem that can be mounted locally on each node (e.g. a NetApp). But if you want to separate the two groups of servers, you could build two separate HDFS filesystems. To separate the two setups, you will need to make sure there is no cross communication between the two parts, You can run two separate clusters even on the same set of machines, just configure them to use different ports AND different local paths. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch Hadoop question
Thanks for the help guys. On Fri, Nov 13, 2009 at 5:20 PM, Andrzej Bialecki a...@getopt.org wrote: TuxRacer69 wrote: Hi Eran, mapreduce has to store its data on HDFS file system. More specifically, it needs read/write access to a shared filesystem. If you are brave enough you can use NFS, too, or any other type of filesystem that can be mounted locally on each node (e.g. a NetApp). But if you want to separate the two groups of servers, you could build two separate HDFS filesystems. To separate the two setups, you will need to make sure there is no cross communication between the two parts, You can run two separate clusters even on the same set of machines, just configure them to use different ports AND different local paths. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Nutch Hadoop question
Hi All, I'm using Nutch with Hadoop with great pleasure - working great and really increase crawling performance on multiple machines. I have two strong machines and two older machines which I would like to use. So far I've been using only the two strong machines with Hadoop. Now I would like to add the two less powerful machines to do some processing as well. My question is - Right now the HDFS is shared between the two powerful computers. I don't want the two other computer to store any content on them as they have a slow and unreliable harddisk. I just want the two other machines to do processing (i.e. mapreduce) and not store any content on them. Is that possible - or do I have to use HDFS on all machines that do processing? If it's possible to use a machine only for mapreduce - how this is done? Thank you for your help, Eran