Re: Merging small files

Shashidhar Rao Sun, 20 Jul 2014 10:48:18 -0700

Spring batch is used to process the files which come in EDI ,CSV & XML
format and store it into Oracle after processing, but this is for a very
small division. Imagine invoices generated  roughly  by 5 million customers
every week from  all stores plus from online purchases. Time to process
such massive data would be not acceptable even though Oracle would be a
good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and
we have no choice but to use Hadoop, but need further processing of input
files just to make hadoop happy .



On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <
[email protected]> wrote:

>   “Even if we kept the discussion to the mailing list's technical Hadoop
> usage focus, any company/organization looking to use a distro is going to
> have to consider the costs, support, platform, partner ecosystem, market
> share, company strategy, etc.”
>
> Yeah good point.
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Shahab Yunus <[email protected]>
> *Sent:* Sunday, July 20, 2014 11:32 AM
> *To:* [email protected]
> *Subject:* Re: Merging small files
>
>  Why it isn't appropriate to discuss too much vendor specific topics on a
> vendor-neutral apache mailing list? Checkout this thread:
>
> http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3ccaj1nbzcocw1rsncf3h-ikjkk4uqxqxt7avsj-6nahq_e4dx...@mail.gmail.com%3E
>
> You can always discuss vendor specific issues in their respective mailing
> lists.
>
> As for merging files, Yes one can use HBase but then you have to keep in
> mind that you are adding overhead of development and maintenance of a
> another store (i.e. HBase). If your use case could be satisfied with HDFS
> alone then why not keep it simple? And given the knowledge of the
> requirements that the OP provided, I think Sequence File format should work
> as I suggested initially. Of course, if things get too complicated from
> requirements perspective then one might try out HBase.
>
> Regards,
> Shahab
>
>
> On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <
> [email protected]> wrote:
>
>>   It isn’t? I don’t wanna hijack the thread or anything but it seems to
>> me that MapR is an implementation of Hadoop and this is a great place to
>> discuss it’s merits vis a vis the Hortonworks or Cloudera offering.
>>
>> A little bit more on topic: Every single thing I read or watch about
>> Hadoop says that many small files is a bad idea and that you should merge
>> them into larger files. I’ll take this a step further. If your invoice data
>> is so small, perhaps Hadoop isn’t the proper solution to whatever it is you
>> are trying to do and a more traditional RDBMS approach would be more
>> appropriate. Someone suggested HBase and I was going to suggest maybe one
>> of the other NoSQL databases, however, I remember that Eddie Satterly of
>> Splunk says that financial data is the ONE use case where a traditional
>> approach is more appropriate. You can watch his talk here:
>>
>> https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>>
>>  *From:* Kilaru, Sambaiah <[email protected]>
>> *Sent:* Sunday, July 20, 2014 3:47 AM
>> *To:* [email protected]
>> *Subject:* Re: Merging small files
>>
>>  This is not place to discuss merits or demerits of MapR, Small files
>> screw up very badly with Mapr.
>> Small files go into one container (to fill up 256MB or what ever
>> container size) and with locality most
>> Of the mappers go to three datanodes.
>>
>> You should be looking into sequence file format.
>>
>> Thanks,
>> Sam
>>
>> From: "M. C. Srivas" <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Sunday, July 20, 2014 at 8:01 AM
>> To: "[email protected]" <[email protected]>
>> Subject: Re: Merging small files
>>
>>  You should look at MapR .... a few 100's of billions of small files is
>> absolutely no problem. (disc: I work for MapR)
>>
>>
>> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
>> [email protected]> wrote:
>>
>>>   Hi ,
>>>
>>> Has anybody worked in retail use case. If my production Hadoop cluster
>>> block size is 256 MB but generally if we have to process retail invoice
>>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>>> data to make one large file say 1 GB . What is the best practice in this
>>> scenario
>>>
>>>
>>> Regards
>>> Shashi
>>>
>>
>>
>
>

Re: Merging small files

Reply via email to