Hi, Average size of my records is 60 bytes - 20 bytes Key and 40 bytes value, table has one column family.
I have setup a cluster for testing - 1 master and 3 region servers. Each have a heap size of 3 GB, single cpu. I have pre-split the table into 30 regions. I do not have to keep data forever, I could purge older records periodically. Thanks, Gautam On Fri, Aug 23, 2013 at 3:20 AM, Ted Yu <[email protected]> wrote: > Can you tell us the average size of your records and how much heap is > given to the region servers ? > > Thanks > > On Aug 23, 2013, at 12:11 AM, Gautam Borah <[email protected]> wrote: > > > Hello all, > > > > I have an use case where I need to write 1 million to 10 million records > > periodically (with intervals of 1 minutes to 10 minutes), into an HBase > > table. > > > > Once the insert is completed, these records are queried immediately from > > another program - multiple reads. > > > > So, this is one massive write followed by many reads. > > > > I have two approaches to insert these records into the HBase table - > > > > Use HTable or HTableMultiplexer to stream the data to HBase table. > > > > or > > > > Write the data to HDFS store as a sequence file (avro in my case) - run > map > > reduce job using HFileOutputFormat and then load the output files into > > HBase cluster. > > Something like, > > > > LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf); > > loader.doBulkLoad(new Path(outputDir), hTable); > > > > > > In my use case which approach would be better? > > > > If I use HTable interface, would the inserted data be in the HBase cache, > > before flushing to the files, for immediate read queries? > > > > If I use map reduce job to insert, would the data be loaded into the > HBase > > cache immediately? or only the output files would be copied to respective > > hbase table specific directories? > > > > So, which approach is better for write and then immediate multiple read > > operations? > > > > Thanks, > > Gautam >
