Large number of small files

Marko Dinic Fri, 24 Apr 2015 01:55:05 -0700

Hello,

I'm not sure if this is the place to ask this question, but I'm stillhopping for an answer/advice.

Large number of small files are uploaded, about 8KB. I am aware thatthis is not something that you're hopping for when working with Hadoop.

I was thinking about using HAR files and combined input, or sequencefiles. The problem is, files are timestamped, and I need differentsubset in different time, for example - one job needs to run on filesthat are uploaded during last 3 months, while next job might considerlast 6 months. Naturally, as time passes different subset of files isneeded.

This means that I would need to make a sequence file (or a HAR) eachtime I run a job, to have smaller number of mappers. On the other hand,I need the original files so I could subset them. This means thatDataNode is at constant pressure, saving all of this in its memory.


How can I solve this problem?

I was also considering using Cassandra, or something like that, and tosave the file content inside of it, instead of saving it to files onHDFS. FIle content is actually some measurement, that is, a vector ofnumbers, with some metadata.


Thanks

Large number of small files

Reply via email to