[jira] [Created] (HBASE-20056) Performance optimization on MultiTableInputFormatBase#getSplits()

2018-02-23 Thread ShivaKumar SS (JIRA)
ShivaKumar SS created HBASE-20056:
-

 Summary: Performance optimization on 
MultiTableInputFormatBase#getSplits() 
 Key: HBASE-20056
 URL: https://issues.apache.org/jira/browse/HBASE-20056
 Project: HBase
  Issue Type: Improvement
  Components: hbase, mapreduce
Affects Versions: 1.0.1
Reporter: ShivaKumar SS


Currently this method iterates the List of scan objects to get splits and for 
each iteration it opens the Connection object and closes it, which is heavy.

It can be optimzed such that a single hbase connection can be used for all the 
scan objects for their splits computation.

This optimization will help in reducing the launch time for MR Job.

We are using a cluster of 15 nodes, and we have around 120~ scan objects. it 
takes 5~ mins to launch a job. on the optimized code, it takes < 30 ~ secs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20844) Duplicate rows returned while hbase snapshot reads

2018-07-04 Thread ShivaKumar SS (JIRA)
ShivaKumar SS created HBASE-20844:
-

 Summary: Duplicate rows returned while hbase snapshot reads
 Key: HBASE-20844
 URL: https://issues.apache.org/jira/browse/HBASE-20844
 Project: HBase
  Issue Type: Bug
  Components: mapreduce, spark
Affects Versions: 1.3.1
 Environment: Cluster Details 

Java1.7
Hbase 1.3.1
Spark  1.6.1
Reporter: ShivaKumar SS


We are trying to take snapshot from code and read data using MR and spark, both 
approaches are returning duplicate records.

On the API side, {{org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormat 
}} is used. 

Snapshot was taken during the table is being in the region split state. 

We suspect it is due to data is being returned for both parent and daughter 
regions.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)