Re: Merging small files

Kilaru, Sambaiah Sun, 20 Jul 2014 10:57:19 -0700

I had expericne with mapr where small files are much worse. Mapr can keep (only 
keep) small files better agreed. Storing is not the answer,
You wanted to run the job and what happens?
A container stores files and container gets replicated, that means one 
container (of 256MB or 128MB or what ever size it is configured) is
Replicated. The moment you start m/r job (and don’t use combinerinputfile 
format) you are actually launching jobs on the three nodes due
To data localization issue.


Small files are bad with Hadoop and worse with mapr when you wanted to run job 
and good with storing.


Sam

From: MBA <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Sunday, July 20, 2014 at 9:54 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Merging small files

It isn’t? I don’t wanna hijack the thread or anything but it seems to me that 
MapR is an implementation of Hadoop and this is a great place to discuss it’s 
merits vis a vis the Hortonworks or Cloudera offering.

A little bit more on topic: Every single thing I read or watch about Hadoop 
says that many small files is a bad idea and that you should merge them into 
larger files. I’ll take this a step further. If your invoice data is so small, 
perhaps Hadoop isn’t the proper solution to whatever it is you are trying to do 
and a more traditional RDBMS approach would be more appropriate. Someone 
suggested HBase and I was going to suggest maybe one of the other NoSQL 
databases, however, I remember that Eddie Satterly of Splunk says that 
financial data is the ONE use case where a traditional approach is more 
appropriate. You can watch his talk here:

https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Kilaru, Sambaiah<mailto:[email protected]>
Sent: Sunday, July 20, 2014 3:47 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Merging small files

This is not place to discuss merits or demerits of MapR, Small files screw up 
very badly with Mapr.
Small files go into one container (to fill up 256MB or what ever container 
size) and with locality most
Of the mappers go to three datanodes.

You should be looking into sequence file format.

Thanks,
Sam

From: "M. C. Srivas" <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Sunday, July 20, 2014 at 8:01 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Merging small files

You should look at MapR .... a few 100's of billions of small files is 
absolutely no problem. (disc: I work for MapR)


On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao 
<[email protected]<mailto:[email protected]>> wrote:
Hi ,

Has anybody worked in retail use case. If my production Hadoop cluster block 
size is 256 MB but generally if we have to process retail invoice data , each 
invoice data is merely let's say 4 KB . Do we merge the invoice data to make 
one large file say 1 GB . What is the best practice in this scenario


Regards
Shashi

Re: Merging small files

Reply via email to