Hi there I got millions of rather small PDF-Files which I want to load into HDFS for later analysis. Also I need to re-encode them as base64-stream to get the MR-Job for parsing work.
Is there any better/faster method of just calling the 'put' function in a huge (bash) loop? Maybe I could implement encoding and loading as an MR-Job itself? Second thing is, according to a cloudera blog I read, it's a bad idea to store small files on HDFS, especially if there are large numbers of them. They recommend HBase instead. However I want to take further action via HCatalog... Thanks for your Suggestions Roger
