RE: Large number of small files

Chandra Mohan, Ananda Vel Murugan Fri, 24 Apr 2015 02:36:02 -0700

Marko,

Parquet file would be created once when you load the data. You don't have to 
store your small files in HDFS just for the reason of subseting the data by 
time range. You can store data and metadata in same Parquet file. As already 
pointed out, parquet files work well other tools in Hadoop ecosystem. Apart 
from performance of your map reduce jobs, other aspect is storage efficiency. 
Serialization formats like Avro and Parquet provide better compression and 
hence data occupies less space.

Regards,
Anand

From: Alexander Alten-Lorenz [mailto:[email protected]]
Sent: Friday, April 24, 2015 2:49 PM
To: [email protected]
Subject: Re: Large number of small files

Marko,

Cassandra is an noSQL DB like HBase for Hadoop is. Pro and cons wouldn't be 
discussed here.

Parquet is an columnar based storage format. It is - high level - a bit like a 
NoSQL DB, but on the storage level. it allows users to "query" the data with 
MR, Pig or similar tools. Additionally, Parquet works perfectly with Hive and 
Cloudera Impala as well as Apache Dremel.

https://parquet.incubator.apache.org/documentation/latest/
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html
https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table

--
Alexander Alten-Lorenz
m: [email protected]<mailto:[email protected]>
b: mapredit.blogspot.com<http://mapredit.blogspot.com>

On Apr 24, 2015, at 11:10 AM, Marko Dinic 
<[email protected]<mailto:[email protected]>> wrote:

Anand,

Thank you for your answer, but wouldn't that mean that I would have to 
serialize the files each time I need to run the job? And I would still need to 
save the original files, so the NameNode still needs to take care of them?

Please correct me if I'm missing something, I'm not very experienced with 
Hadoop.

What do you think about using Cassandra?

Thanks

On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan wrote:

Apart from databases like Cassandra, you may check serialization formats like 
Avro or Parquet

Regards,
Anand

-----Original Message-----
From: Marko Dinic [mailto:[email protected]]
Sent: Friday, April 24, 2015 2:23 PM
To: [email protected]<mailto:[email protected]>
Subject: Large number of small files

Hello,

I'm not sure if this is the place to ask this question, but I'm still hopping 
for an answer/advice.

Large number of small files are uploaded, about 8KB. I am aware that this is 
not something that you're hopping for when working with Hadoop.

I was thinking about using HAR files and combined input, or sequence files. The 
problem is, files are timestamped, and I need different subset in different 
time, for example - one job needs to run on files that are uploaded during last 
3 months, while next job might consider last 6 months. Naturally, as time 
passes different subset of files is needed.

This means that I would need to make a sequence file (or a HAR) each time I run 
a job, to have smaller number of mappers. On the other hand, I need the 
original files so I could subset them. This means that DataNode is at constant 
pressure, saving all of this in its memory.

How can I solve this problem?

I was also considering using Cassandra, or something like that, and to save the 
file content inside of it, instead of saving it to files on HDFS. FIle content 
is actually some measurement, that is, a vector of numbers, with some metadata.

Thanks

RE: Large number of small files

Reply via email to