Re: Large number of small files

Takenori Sato Fri, 24 Apr 2015 03:02:27 -0700

Hi Marko,

I think there are two major problems you should care.

1. name node memory
2. job overhead

To avoid 1, I suggest to store your data to external file system like
HBase, S3, or Cassandra, not HDFS.
For details, please refer to the following.
https://www.usenix.org/legacy/publications/login/2010-04/openpdfs/shvachko.pdf

For 2, you may want to use a higher level language like pig,
which will automatically combine a bunch of your small inputs into a larger
one.
http://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html

Thanks,
Takenori

On Fri, Apr 24, 2015 at 5:53 PM, Marko Dinic <[email protected]>
wrote:

> Hello,
>
> I'm not sure if this is the place to ask this question, but I'm still
> hopping for an answer/advice.
>
> Large number of small files are uploaded, about 8KB. I am aware that this
> is not something that you're hopping for when working with Hadoop.
>
> I was thinking about using HAR files and combined input, or sequence
> files. The problem is, files are timestamped, and I need different subset
> in different time, for example - one job needs to run on files that are
> uploaded during last 3 months, while next job might consider last 6 months.
> Naturally, as time passes different subset of files is needed.
>
> This means that I would need to make a sequence file (or a HAR) each time
> I run a job, to have smaller number of mappers. On the other hand, I need
> the original files so I could subset them. This means that DataNode is at
> constant pressure, saving all of this in its memory.
>
> How can I solve this problem?
>
> I was also considering using Cassandra, or something like that, and to
> save the file content inside of it, instead of saving it to files on HDFS.
> FIle content is actually some measurement, that is, a vector of numbers,
> with some metadata.
>
> Thanks
>

Re: Large number of small files

Reply via email to