That’s an interesting use case for Storm. Usually people talk about Storm in 
terms of processing things like twitter or events like web logs. Never seen it 
in terms of processing files especially EDI files where they usually come in in 
terms of groups of transactions instead of atomic events like a single line 
item in an invoice.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Mark Kerzner 
Sent: Sunday, July 20, 2014 2:08 PM
To: Hadoop User 
Subject: Re: Merging small files

Bob, 

you don't have to wait for batch. Here is my project (under development) where 
I am using Storm for continuous file processing, 
https://github.com/markkerzner/3VEed

Mark



On Sun, Jul 20, 2014 at 1:31 PM, Adaryl "Bob" Wakefield, MBA 
<[email protected]> wrote:

  Yeah  I’m sorry I’m not talking about processing the files in Oracle. I mean 
collect/store invoices in Oracle then flush them in a batch to Hadoop. This is 
not real time right? So you take your EDI,CSV and XML from their sources. Store 
them in Oracle. Once you have a decent size, flush them to Hadoop in one big 
file, process them, then store the results of the processing in Oracle.

  Source file –> Oracle –> Hadoop –> Oracle

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba

  From: Shashidhar Rao 
  Sent: Sunday, July 20, 2014 12:47 PM
  To: [email protected] 
  Subject: Re: Merging small files

  Spring batch is used to process the files which come in EDI ,CSV & XML format 
and store it into Oracle after processing, but this is for a very small 
division. Imagine invoices generated  roughly  by 5 million customers every 
week from  all stores plus from online purchases. Time to process such massive 
data would be not acceptable even though Oracle would be a good choice as 
Adaryl Bob has suggested. Each invoice is not even 10 k and we have no choice 
but to use Hadoop, but need further processing of input files just to make 
hadoop happy .




  On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA 
<[email protected]> wrote:

    “Even if we kept the discussion to the mailing list's technical Hadoop 
usage focus, any company/organization looking to use a distro is going to have 
to consider the costs, support, platform, partner ecosystem, market share, 
company strategy, etc.”

    Yeah good point.

    Adaryl "Bob" Wakefield, MBA
    Principal
    Mass Street Analytics
    913.938.6685
    www.linkedin.com/in/bobwakefieldmba

    From: Shahab Yunus 
    Sent: Sunday, July 20, 2014 11:32 AM
    To: [email protected] 
    Subject: Re: Merging small files

    Why it isn't appropriate to discuss too much vendor specific topics on a 
vendor-neutral apache mailing list? Checkout this thread: 
    
http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3ccaj1nbzcocw1rsncf3h-ikjkk4uqxqxt7avsj-6nahq_e4dx...@mail.gmail.com%3E


    You can always discuss vendor specific issues in their respective mailing 
lists.

    As for merging files, Yes one can use HBase but then you have to keep in 
mind that you are adding overhead of development and maintenance of a another 
store (i.e. HBase). If your use case could be satisfied with HDFS alone then 
why not keep it simple? And given the knowledge of the requirements that the OP 
provided, I think Sequence File format should work as I suggested initially. Of 
course, if things get too complicated from requirements perspective then one 
might try out HBase.

    Regards,
    Shahab



    On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA 
<[email protected]> wrote:

      It isn’t? I don’t wanna hijack the thread or anything but it seems to me 
that MapR is an implementation of Hadoop and this is a great place to discuss 
it’s merits vis a vis the Hortonworks or Cloudera offering. 

      A little bit more on topic: Every single thing I read or watch about 
Hadoop says that many small files is a bad idea and that you should merge them 
into larger files. I’ll take this a step further. If your invoice data is so 
small, perhaps Hadoop isn’t the proper solution to whatever it is you are 
trying to do and a more traditional RDBMS approach would be more appropriate. 
Someone suggested HBase and I was going to suggest maybe one of the other NoSQL 
databases, however, I remember that Eddie Satterly of Splunk says that 
financial data is the ONE use case where a traditional approach is more 
appropriate. You can watch his talk here:

      https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

      Adaryl "Bob" Wakefield, MBA
      Principal
      Mass Street Analytics
      913.938.6685
      www.linkedin.com/in/bobwakefieldmba

      From: Kilaru, Sambaiah 
      Sent: Sunday, July 20, 2014 3:47 AM
      To: [email protected] 
      Subject: Re: Merging small files

      This is not place to discuss merits or demerits of MapR, Small files 
screw up very badly with Mapr.
      Small files go into one container (to fill up 256MB or what ever 
container size) and with locality most
      Of the mappers go to three datanodes.

      You should be looking into sequence file format.

      Thanks,
      Sam

      From: "M. C. Srivas" <[email protected]>
      Reply-To: "[email protected]" <[email protected]>
      Date: Sunday, July 20, 2014 at 8:01 AM
      To: "[email protected]" <[email protected]>
      Subject: Re: Merging small files


      You should look at MapR .... a few 100's of billions of small files is 
absolutely no problem. (disc: I work for MapR)



      On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao 
<[email protected]> wrote:

        Hi ,


        Has anybody worked in retail use case. If my production Hadoop cluster 
block size is 256 MB but generally if we have to process retail invoice data , 
each invoice data is merely let's say 4 KB . Do we merge the invoice data to 
make one large file say 1 GB . What is the best practice in this scenario



        Regards

        Shashi




Reply via email to