Bob, you don't have to wait for batch. Here is my project (under development) where I am using Storm for continuous file processing, https://github.com/markkerzner/3VEed
Mark On Sun, Jul 20, 2014 at 1:31 PM, Adaryl "Bob" Wakefield, MBA < [email protected]> wrote: > Yeah I’m sorry I’m not talking about processing the files in Oracle. I > mean collect/store invoices in Oracle then flush them in a batch to Hadoop. > This is not real time right? So you take your EDI,CSV and XML from their > sources. Store them in Oracle. Once you have a decent size, flush them to > Hadoop in one big file, process them, then store the results of the > processing in Oracle. > > Source file –> Oracle –> Hadoop –> Oracle > > Adaryl "Bob" Wakefield, MBA > Principal > Mass Street Analytics > 913.938.6685 > www.linkedin.com/in/bobwakefieldmba > > *From:* Shashidhar Rao <[email protected]> > *Sent:* Sunday, July 20, 2014 12:47 PM > *To:* [email protected] > *Subject:* Re: Merging small files > > Spring batch is used to process the files which come in EDI ,CSV & XML > format and store it into Oracle after processing, but this is for a very > small division. Imagine invoices generated roughly by 5 million customers > every week from all stores plus from online purchases. Time to process > such massive data would be not acceptable even though Oracle would be a > good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and > we have no choice but to use Hadoop, but need further processing of input > files just to make hadoop happy . > > > On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA < > [email protected]> wrote: > >> “Even if we kept the discussion to the mailing list's technical Hadoop >> usage focus, any company/organization looking to use a distro is going to >> have to consider the costs, support, platform, partner ecosystem, market >> share, company strategy, etc.” >> >> Yeah good point. >> >> Adaryl "Bob" Wakefield, MBA >> Principal >> Mass Street Analytics >> 913.938.6685 >> www.linkedin.com/in/bobwakefieldmba >> >> *From:* Shahab Yunus <[email protected]> >> *Sent:* Sunday, July 20, 2014 11:32 AM >> *To:* [email protected] >> *Subject:* Re: Merging small files >> >> Why it isn't appropriate to discuss too much vendor specific topics on >> a vendor-neutral apache mailing list? Checkout this thread: >> >> http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3ccaj1nbzcocw1rsncf3h-ikjkk4uqxqxt7avsj-6nahq_e4dx...@mail.gmail.com%3E >> >> You can always discuss vendor specific issues in their respective mailing >> lists. >> >> As for merging files, Yes one can use HBase but then you have to keep in >> mind that you are adding overhead of development and maintenance of a >> another store (i.e. HBase). If your use case could be satisfied with HDFS >> alone then why not keep it simple? And given the knowledge of the >> requirements that the OP provided, I think Sequence File format should work >> as I suggested initially. Of course, if things get too complicated from >> requirements perspective then one might try out HBase. >> >> Regards, >> Shahab >> >> >> On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA < >> [email protected]> wrote: >> >>> It isn’t? I don’t wanna hijack the thread or anything but it seems to >>> me that MapR is an implementation of Hadoop and this is a great place to >>> discuss it’s merits vis a vis the Hortonworks or Cloudera offering. >>> >>> A little bit more on topic: Every single thing I read or watch about >>> Hadoop says that many small files is a bad idea and that you should merge >>> them into larger files. I’ll take this a step further. If your invoice data >>> is so small, perhaps Hadoop isn’t the proper solution to whatever it is you >>> are trying to do and a more traditional RDBMS approach would be more >>> appropriate. Someone suggested HBase and I was going to suggest maybe one >>> of the other NoSQL databases, however, I remember that Eddie Satterly of >>> Splunk says that financial data is the ONE use case where a traditional >>> approach is more appropriate. You can watch his talk here: >>> >>> https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL >>> >>> Adaryl "Bob" Wakefield, MBA >>> Principal >>> Mass Street Analytics >>> 913.938.6685 >>> www.linkedin.com/in/bobwakefieldmba >>> >>> *From:* Kilaru, Sambaiah <[email protected]> >>> *Sent:* Sunday, July 20, 2014 3:47 AM >>> *To:* [email protected] >>> *Subject:* Re: Merging small files >>> >>> This is not place to discuss merits or demerits of MapR, Small files >>> screw up very badly with Mapr. >>> Small files go into one container (to fill up 256MB or what ever >>> container size) and with locality most >>> Of the mappers go to three datanodes. >>> >>> You should be looking into sequence file format. >>> >>> Thanks, >>> Sam >>> >>> From: "M. C. Srivas" <[email protected]> >>> Reply-To: "[email protected]" <[email protected]> >>> Date: Sunday, July 20, 2014 at 8:01 AM >>> To: "[email protected]" <[email protected]> >>> Subject: Re: Merging small files >>> >>> You should look at MapR .... a few 100's of billions of small files is >>> absolutely no problem. (disc: I work for MapR) >>> >>> >>> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao < >>> [email protected]> wrote: >>> >>>> Hi , >>>> >>>> Has anybody worked in retail use case. If my production Hadoop cluster >>>> block size is 256 MB but generally if we have to process retail invoice >>>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice >>>> data to make one large file say 1 GB . What is the best practice in this >>>> scenario >>>> >>>> >>>> Regards >>>> Shashi >>>> >>> >>> >> >> > >
