Spring batch is used to process the files which come in EDI ,CSV & XML format and store it into Oracle after processing, but this is for a very small division. Imagine invoices generated roughly by 5 million customers every week from all stores plus from online purchases. Time to process such massive data would be not acceptable even though Oracle would be a good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and we have no choice but to use Hadoop, but need further processing of input files just to make hadoop happy .
On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA < [email protected]> wrote: > “Even if we kept the discussion to the mailing list's technical Hadoop > usage focus, any company/organization looking to use a distro is going to > have to consider the costs, support, platform, partner ecosystem, market > share, company strategy, etc.” > > Yeah good point. > > Adaryl "Bob" Wakefield, MBA > Principal > Mass Street Analytics > 913.938.6685 > www.linkedin.com/in/bobwakefieldmba > > *From:* Shahab Yunus <[email protected]> > *Sent:* Sunday, July 20, 2014 11:32 AM > *To:* [email protected] > *Subject:* Re: Merging small files > > Why it isn't appropriate to discuss too much vendor specific topics on a > vendor-neutral apache mailing list? Checkout this thread: > > http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3ccaj1nbzcocw1rsncf3h-ikjkk4uqxqxt7avsj-6nahq_e4dx...@mail.gmail.com%3E > > You can always discuss vendor specific issues in their respective mailing > lists. > > As for merging files, Yes one can use HBase but then you have to keep in > mind that you are adding overhead of development and maintenance of a > another store (i.e. HBase). If your use case could be satisfied with HDFS > alone then why not keep it simple? And given the knowledge of the > requirements that the OP provided, I think Sequence File format should work > as I suggested initially. Of course, if things get too complicated from > requirements perspective then one might try out HBase. > > Regards, > Shahab > > > On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA < > [email protected]> wrote: > >> It isn’t? I don’t wanna hijack the thread or anything but it seems to >> me that MapR is an implementation of Hadoop and this is a great place to >> discuss it’s merits vis a vis the Hortonworks or Cloudera offering. >> >> A little bit more on topic: Every single thing I read or watch about >> Hadoop says that many small files is a bad idea and that you should merge >> them into larger files. I’ll take this a step further. If your invoice data >> is so small, perhaps Hadoop isn’t the proper solution to whatever it is you >> are trying to do and a more traditional RDBMS approach would be more >> appropriate. Someone suggested HBase and I was going to suggest maybe one >> of the other NoSQL databases, however, I remember that Eddie Satterly of >> Splunk says that financial data is the ONE use case where a traditional >> approach is more appropriate. You can watch his talk here: >> >> https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL >> >> Adaryl "Bob" Wakefield, MBA >> Principal >> Mass Street Analytics >> 913.938.6685 >> www.linkedin.com/in/bobwakefieldmba >> >> *From:* Kilaru, Sambaiah <[email protected]> >> *Sent:* Sunday, July 20, 2014 3:47 AM >> *To:* [email protected] >> *Subject:* Re: Merging small files >> >> This is not place to discuss merits or demerits of MapR, Small files >> screw up very badly with Mapr. >> Small files go into one container (to fill up 256MB or what ever >> container size) and with locality most >> Of the mappers go to three datanodes. >> >> You should be looking into sequence file format. >> >> Thanks, >> Sam >> >> From: "M. C. Srivas" <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Sunday, July 20, 2014 at 8:01 AM >> To: "[email protected]" <[email protected]> >> Subject: Re: Merging small files >> >> You should look at MapR .... a few 100's of billions of small files is >> absolutely no problem. (disc: I work for MapR) >> >> >> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao < >> [email protected]> wrote: >> >>> Hi , >>> >>> Has anybody worked in retail use case. If my production Hadoop cluster >>> block size is 256 MB but generally if we have to process retail invoice >>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice >>> data to make one large file say 1 GB . What is the best practice in this >>> scenario >>> >>> >>> Regards >>> Shashi >>> >> >> > >
