Re: Merging small files

Mark Kerzner Sun, 20 Jul 2014 12:09:56 -0700

Bob,

you don't have to wait for batch. Here is my project (under development)
where I am using Storm for continuous file processing,
https://github.com/markkerzner/3VEed


Mark


On Sun, Jul 20, 2014 at 1:31 PM, Adaryl "Bob" Wakefield, MBA <
[email protected]> wrote:

>   Yeah  I’m sorry I’m not talking about processing the files in Oracle. I
> mean collect/store invoices in Oracle then flush them in a batch to Hadoop.
> This is not real time right? So you take your EDI,CSV and XML from their
> sources. Store them in Oracle. Once you have a decent size, flush them to
> Hadoop in one big file, process them, then store the results of the
> processing in Oracle.
>
> Source file –> Oracle –> Hadoop –> Oracle
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Shashidhar Rao <[email protected]>
> *Sent:* Sunday, July 20, 2014 12:47 PM
> *To:* [email protected]
> *Subject:* Re: Merging small files
>
>  Spring batch is used to process the files which come in EDI ,CSV & XML
> format and store it into Oracle after processing, but this is for a very
> small division. Imagine invoices generated  roughly  by 5 million customers
> every week from  all stores plus from online purchases. Time to process
> such massive data would be not acceptable even though Oracle would be a
> good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and
> we have no choice but to use Hadoop, but need further processing of input
> files just to make hadoop happy .
>
>
> On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <
> [email protected]> wrote:
>
>>   “Even if we kept the discussion to the mailing list's technical Hadoop
>> usage focus, any company/organization looking to use a distro is going to
>> have to consider the costs, support, platform, partner ecosystem, market
>> share, company strategy, etc.”
>>
>> Yeah good point.
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>>
>>  *From:* Shahab Yunus <[email protected]>
>> *Sent:* Sunday, July 20, 2014 11:32 AM
>>  *To:* [email protected]
>> *Subject:* Re: Merging small files
>>
>>   Why it isn't appropriate to discuss too much vendor specific topics on
>> a vendor-neutral apache mailing list? Checkout this thread:
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3ccaj1nbzcocw1rsncf3h-ikjkk4uqxqxt7avsj-6nahq_e4dx...@mail.gmail.com%3E
>>
>> You can always discuss vendor specific issues in their respective mailing
>> lists.
>>
>> As for merging files, Yes one can use HBase but then you have to keep in
>> mind that you are adding overhead of development and maintenance of a
>> another store (i.e. HBase). If your use case could be satisfied with HDFS
>> alone then why not keep it simple? And given the knowledge of the
>> requirements that the OP provided, I think Sequence File format should work
>> as I suggested initially. Of course, if things get too complicated from
>> requirements perspective then one might try out HBase.
>>
>> Regards,
>> Shahab
>>
>>
>> On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <
>> [email protected]> wrote:
>>
>>>   It isn’t? I don’t wanna hijack the thread or anything but it seems to
>>> me that MapR is an implementation of Hadoop and this is a great place to
>>> discuss it’s merits vis a vis the Hortonworks or Cloudera offering.
>>>
>>> A little bit more on topic: Every single thing I read or watch about
>>> Hadoop says that many small files is a bad idea and that you should merge
>>> them into larger files. I’ll take this a step further. If your invoice data
>>> is so small, perhaps Hadoop isn’t the proper solution to whatever it is you
>>> are trying to do and a more traditional RDBMS approach would be more
>>> appropriate. Someone suggested HBase and I was going to suggest maybe one
>>> of the other NoSQL databases, however, I remember that Eddie Satterly of
>>> Splunk says that financial data is the ONE use case where a traditional
>>> approach is more appropriate. You can watch his talk here:
>>>
>>> https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL
>>>
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics
>>> 913.938.6685
>>> www.linkedin.com/in/bobwakefieldmba
>>>
>>>  *From:* Kilaru, Sambaiah <[email protected]>
>>> *Sent:* Sunday, July 20, 2014 3:47 AM
>>> *To:* [email protected]
>>> *Subject:* Re: Merging small files
>>>
>>>  This is not place to discuss merits or demerits of MapR, Small files
>>> screw up very badly with Mapr.
>>> Small files go into one container (to fill up 256MB or what ever
>>> container size) and with locality most
>>> Of the mappers go to three datanodes.
>>>
>>> You should be looking into sequence file format.
>>>
>>> Thanks,
>>> Sam
>>>
>>> From: "M. C. Srivas" <[email protected]>
>>> Reply-To: "[email protected]" <[email protected]>
>>> Date: Sunday, July 20, 2014 at 8:01 AM
>>> To: "[email protected]" <[email protected]>
>>> Subject: Re: Merging small files
>>>
>>>  You should look at MapR .... a few 100's of billions of small files is
>>> absolutely no problem. (disc: I work for MapR)
>>>
>>>
>>> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
>>> [email protected]> wrote:
>>>
>>>>   Hi ,
>>>>
>>>> Has anybody worked in retail use case. If my production Hadoop cluster
>>>> block size is 256 MB but generally if we have to process retail invoice
>>>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>>>> data to make one large file say 1 GB . What is the best practice in this
>>>> scenario
>>>>
>>>>
>>>> Regards
>>>> Shashi
>>>>
>>>
>>>
>>
>>
>
>

Re: Merging small files

Reply via email to