Setting up Spark/flume/? to Ingest 10TB from FTP

2015-08-14 Thread Varadhan, Jawahar
What is the best way to bring such a huge file from a FTP server into Hadoop to persist in HDFS? Since a single jvm process might run out of memory, I was wondering if I can use Spark or Flume to do this. Any help on this matter is appreciated.  I prefer a application/process running inside

Re: Setting up Spark/flume/? to Ingest 10TB from FTP

2015-08-14 Thread Marcelo Vanzin
Why do you need to use Spark or Flume for this? You can just use curl and hdfs: curl ftp://blah | hdfs dfs -put - /blah On Fri, Aug 14, 2015 at 1:15 PM, Varadhan, Jawahar varad...@yahoo.com.invalid wrote: What is the best way to bring such a huge file from a FTP server into Hadoop to

Re: Setting up Spark/flume/? to Ingest 10TB from FTP

2015-08-14 Thread Jörn Franke
Well what do you do in case of failure? I think one should use a professional ingestion tool that ideally does not need to reload everything in case of failure and verifies that the file has been transferred correctly via checksums. I am not sure if Flume supports ftp, but Ssh,scp should be