RE: I need advice on whether my starting data needs to be in HDFS

2014-05-19 Thread Christoph Schmitz
apache.org Subject: Re: I need advice on whether my starting data needs to be in HDFS That's pre-processing. I.e., yes, before you run your job, when you push the file from local file system into HDFS then it is reading from a single machine. However, presumably the processing your

Re: I need advice on whether my starting data needs to be in HDFS

2014-05-19 Thread David Rosenstrauch
That's pre-processing. I.e., yes, before you run your job, when you push the file from local file system into HDFS then it is reading from a single machine. However, presumably the processing your job does is the (far) more lengthy activity, and so running the job can take much less time if t

Re: I need advice on whether my starting data needs to be in HDFS

2014-05-19 Thread Steve Lewis
I understand that although I can write an InputFormat / splitter which starts with a file in NFS not HDFS. Also when I count import to hdfs as a part of processing haven't I gone to a single machine to read all data? On Mon, May 19, 2014 at 2:30 PM, David Rosenstrauch wrote: > The reason why you

Re: I need advice on whether my starting data needs to be in HDFS

2014-05-19 Thread David Rosenstrauch
The reason why you want to copy to hdfs first is that hdfs splits the data and distributes it across the nodes in the cluster. So if your input data is large, you'll get much better efficiency/speed in processing it if you're processing it in a distributed manner. (I.e., multiple machines eac

Re: I need advice on whether my starting data needs to be in HDFS

2014-05-19 Thread David Rosenstrauch
On 05/15/2014 04:45 PM, Steve Lewis wrote: I have a medium size data set in the terrabytes range that currently lives in the nfs file server of a medium institution. Every few months we want to run a chain of five Hadoop jobs on this data. The cluster is medium sized - 40 nodes about 200 simu