bq. it calls the persistence method asynchronously Assuming the persistence method is still executing when the next threshold value is reached, do you have other threads to do persistence ? If so, how many threads can potentially run at the same time ?
How many regions does the table have ? What's the distribution of parameter Ids in the input file ? One case is that the parameter Ids are sequential w.r.t. region boundaries, ending up with writes region by region. On Wed, Nov 23, 2016 at 8:01 AM, schausson <[email protected]> wrote: > Hi, > > I am new to HBase and I'm facing performance issues ... > > Short story : I want to persist 10000000 values in HBase and it takes same > time on a basic sandbox (HDP hadoop sandbox with single region server node) > as it takes on our "production" cluster (that comprises 12 region server > with higher capabilities than my developer's laptop ...) > > Detailed case : > > Basically, the use case is : My java application receives a binary file > that > contains timeseries, decodes them and stores decoded data into a single > HBase table. > HBase table design : we store one parameter per row, and we create one > column per timestamp to store associated value. > My test case is based on an input file that spawns ~2000 rows/parameters > containing ~5000 values per row (=> around 10000000 values to store in my > HBase table in the end) > > For this purpose, my application uses hbase client API : > Basically, my code proceeds as following : it decodes parameters timeseries > from input file and stores these values in a map<paramId, List<value>>. > > When it reaches 10000 values (threshold that may be changed), it calls the > persistence method asynchronously and continue decoding operation till end > of the input file. > The persistence method proceeds like this (simplified code) : > /for (paramId : map.keys) { > Put put = new Put(paramId); > for (value : map.get(paramId)) { > put.addColumn(family, columnName, value) > } > table.put(put); > } > / > Choosing a threshold value of 10000 leads to ~1000 calls to persistence > method. Each call generates 2000 calls to table.put() method, each put > containing ~5 columns. > > When I run this on HDP sandbox on my laptop (single region server), it > processes in less than 2 minutes > When I run this on our production cluster (12 region servers), it processes > in 2 minutes and sometimes more. > > My question is : is the writting load distributed across all the region > servers ? obviously no... What should I do if I want my application to > scale > properly when we add additional region servers ? > > I don't know if I gave enough information, so please do not hesitate to ask > me more detail if needed, but any help would be greatly appreciated ... > > Regards > > Sebastien > > > > > -- > View this message in context: http://apache-hbase.679495.n3. > nabble.com/Writting-bottleneck-in-HBase-tp4084656.html > Sent from the HBase User mailing list archive at Nabble.com. >
