Hi there
I got millions of rather small PDF-Files which I want to load into HDFS for
later analysis. Also I need to re-encode them as base64-stream to get the
MR-Job for parsing work.

Is there any better/faster method of just calling the 'put' function in a
huge (bash) loop? Maybe I could implement encoding and loading as an MR-Job
itself?

Second thing is, according to a cloudera blog I read, it's a bad idea to
store small files on HDFS, especially if there are large numbers of them.
They recommend HBase instead. However I want to take further action via
HCatalog...

Thanks for your Suggestions
Roger

Reply via email to