Beginners Hadoop question

2014-03-03 Thread goi cto
Hi,

I am sorry for the beginners question but...
I have a spark java code which reads a file (c:\my-input.csv) process it
and writes an output file (my-output.csv)
Now I want to run it on Hadoop in a distributed environment
1) My inlut file should be one big file or separate smaller files?
2) if we are using smaller files, how does my code needs to change to
process all of the input files?

Will Hadoop just copy the files to different servers or will it also split
their content among servers?

Any example will be great!
-- 
Eran | CTO


Re: Beginners Hadoop question

2014-03-03 Thread Alonso Isidoro Roman
Hi, i am a beginner too, but as i have learned, hadoop works better with
big files, at least with 64MB, 128MB or even more. I think you need to
aggregate all the files into a new big one. Then you must copy to HDFS
using this command:

hadoop fs -put MYFILE /YOUR_ROUTE_ON_HDFS/MYFILE

hadoop just copy MYFILE into hadoop distributed file system.

Can i recommend you what i have done? go to BigDataUniversity.com and take
the Hadoop Fundamentals I course. It is free and very well documented.

Regards

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos...
 -  Edsger Dijkstra

My favorite quotes (today):
If debugging is the process of removing software bugs, then programming
must be the process of putting ...
  - Edsger Dijkstra

If you pay peanuts you get monkeys



2014-03-03 12:10 GMT+01:00 goi cto goi@gmail.com:

 Hi,

 I am sorry for the beginners question but...
 I have a spark java code which reads a file (c:\my-input.csv) process it
 and writes an output file (my-output.csv)
 Now I want to run it on Hadoop in a distributed environment
 1) My inlut file should be one big file or separate smaller files?
 2) if we are using smaller files, how does my code needs to change to
 process all of the input files?

 Will Hadoop just copy the files to different servers or will it also split
 their content among servers?

 Any example will be great!
 --
 Eran | CTO