Hi Marko, I think there are two major problems you should care.
1. name node memory 2. job overhead To avoid 1, I suggest to store your data to external file system like HBase, S3, or Cassandra, not HDFS. For details, please refer to the following. https://www.usenix.org/legacy/publications/login/2010-04/openpdfs/shvachko.pdf For 2, you may want to use a higher level language like pig, which will automatically combine a bunch of your small inputs into a larger one. http://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html Thanks, Takenori On Fri, Apr 24, 2015 at 5:53 PM, Marko Dinic <[email protected]> wrote: > Hello, > > I'm not sure if this is the place to ask this question, but I'm still > hopping for an answer/advice. > > Large number of small files are uploaded, about 8KB. I am aware that this > is not something that you're hopping for when working with Hadoop. > > I was thinking about using HAR files and combined input, or sequence > files. The problem is, files are timestamped, and I need different subset > in different time, for example - one job needs to run on files that are > uploaded during last 3 months, while next job might consider last 6 months. > Naturally, as time passes different subset of files is needed. > > This means that I would need to make a sequence file (or a HAR) each time > I run a job, to have smaller number of mappers. On the other hand, I need > the original files so I could subset them. This means that DataNode is at > constant pressure, saving all of this in its memory. > > How can I solve this problem? > > I was also considering using Cassandra, or something like that, and to > save the file content inside of it, instead of saving it to files on HDFS. > FIle content is actually some measurement, that is, a vector of numbers, > with some metadata. > > Thanks >
