HDFS - many files, small size

Roger Maillist Thu, 02 Oct 2014 01:12:56 -0700

Hi there
I got millions of rather small PDF-Files which I want to load into HDFS for
later analysis. Also I need to re-encode them as base64-stream to get the
MR-Job for parsing work.


Is there any better/faster method of just calling the 'put' function in a
huge (bash) loop? Maybe I could implement encoding and loading as an MR-Job
itself?

Second thing is, according to a cloudera blog I read, it's a bad idea to
store small files on HDFS, especially if there are large numbers of them.
They recommend HBase instead. However I want to take further action via
HCatalog...

Thanks for your Suggestions
Roger

HDFS - many files, small size

Reply via email to