>What would be the behavior for inserting data using map reduce job? would the recently added records be in the memstore? or I need to load them for read queries after the insert is done?
Using MR you have 2 options for insertion. One will create the HFiles directly as o/p (Using HFileOutputFormat) Here there is no memstore coming into picture. In the other one there will be calls to HTable#put() from mappers. Here memstore will come into picture.(These are mapper alone jobs) When you are using ImportTSV tool and you are giving "importtsv.bulk.output" , it will go with 1st way.. JFYI.. Have a look at ImportTSV tool documentation. -Anoop- On Sat, Aug 24, 2013 at 4:10 AM, Gautam Borah <[email protected]>wrote: > Thanks Ted for your response, and clarifying the behavior for using HTable > interface. > > What would be the behavior for inserting data using map reduce job? would > the recently added records be in the memstore? or I need to load them for > read queries after the insert is done? > > Thanks, > Gautam > > > On Fri, Aug 23, 2013 at 2:43 PM, Ted Yu <[email protected]> wrote: > > > Assuming you are using 0.94, the default value > > for hbase.regionserver.global.memstore.lowerLimit is 0.35 > > > > Meaning, memstore on each region server would be able to hold 3000M * > 0.35 > > / 60 = 17.5 mil records (roughly). > > > > bq. If I use HTable interface, would the inserted data be in the HBase > > cache, before flushing to the files, for immediate read queries? > > > > Yes. > > > > Cheers > > > > > > On Fri, Aug 23, 2013 at 12:01 PM, Gautam Borah <[email protected] > > >wrote: > > > > > Hi, > > > > > > Average size of my records is 60 bytes - 20 bytes Key and 40 bytes > value, > > > table has one column family. > > > > > > I have setup a cluster for testing - 1 master and 3 region servers. > Each > > > have a heap size of 3 GB, single cpu. > > > > > > I have pre-split the table into 30 regions. I do not have to keep data > > > forever, I could purge older records periodically. > > > > > > Thanks, > > > > > > Gautam > > > > > > > > > > > > On Fri, Aug 23, 2013 at 3:20 AM, Ted Yu <[email protected]> wrote: > > > > > > > Can you tell us the average size of your records and how much heap is > > > > given to the region servers ? > > > > > > > > Thanks > > > > > > > > On Aug 23, 2013, at 12:11 AM, Gautam Borah <[email protected]> > > > wrote: > > > > > > > > > Hello all, > > > > > > > > > > I have an use case where I need to write 1 million to 10 million > > > records > > > > > periodically (with intervals of 1 minutes to 10 minutes), into an > > HBase > > > > > table. > > > > > > > > > > Once the insert is completed, these records are queried immediately > > > from > > > > > another program - multiple reads. > > > > > > > > > > So, this is one massive write followed by many reads. > > > > > > > > > > I have two approaches to insert these records into the HBase table > - > > > > > > > > > > Use HTable or HTableMultiplexer to stream the data to HBase table. > > > > > > > > > > or > > > > > > > > > > Write the data to HDFS store as a sequence file (avro in my case) - > > run > > > > map > > > > > reduce job using HFileOutputFormat and then load the output files > > into > > > > > HBase cluster. > > > > > Something like, > > > > > > > > > > LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf); > > > > > loader.doBulkLoad(new Path(outputDir), hTable); > > > > > > > > > > > > > > > In my use case which approach would be better? > > > > > > > > > > If I use HTable interface, would the inserted data be in the HBase > > > cache, > > > > > before flushing to the files, for immediate read queries? > > > > > > > > > > If I use map reduce job to insert, would the data be loaded into > the > > > > HBase > > > > > cache immediately? or only the output files would be copied to > > > respective > > > > > hbase table specific directories? > > > > > > > > > > So, which approach is better for write and then immediate multiple > > read > > > > > operations? > > > > > > > > > > Thanks, > > > > > Gautam > > > > > > > > > >
