It is not advisable to have many small files in hdfs as it can put memory load on Namenode as it maintains the metadata, to highlight one major issue.
On the top of my head, some basic ideas...You can either combine invoices into a bigger text file containing a collection of records where each record is an invoices or even follow a sequence file format where the id could be the invoice id and value/record the invoice details. Regards, Shahab On Jul 19, 2014 1:30 PM, "Shashidhar Rao" <[email protected]> wrote: > Hi , > > Has anybody worked in retail use case. If my production Hadoop cluster > block size is 256 MB but generally if we have to process retail invoice > data , each invoice data is merely let's say 4 KB . Do we merge the invoice > data to make one large file say 1 GB . What is the best practice in this > scenario > > > Regards > Shashi >
