[jira] [Created] (HBASE-20056) Performance optimization on MultiTableInputFormatBase#getSplits()
ShivaKumar SS created HBASE-20056: - Summary: Performance optimization on MultiTableInputFormatBase#getSplits() Key: HBASE-20056 URL: https://issues.apache.org/jira/browse/HBASE-20056 Project: HBase Issue Type: Improvement Components: hbase, mapreduce Affects Versions: 1.0.1 Reporter: ShivaKumar SS Currently this method iterates the List of scan objects to get splits and for each iteration it opens the Connection object and closes it, which is heavy. It can be optimzed such that a single hbase connection can be used for all the scan objects for their splits computation. This optimization will help in reducing the launch time for MR Job. We are using a cluster of 15 nodes, and we have around 120~ scan objects. it takes 5~ mins to launch a job. on the optimized code, it takes < 30 ~ secs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-20844) Duplicate rows returned while hbase snapshot reads
ShivaKumar SS created HBASE-20844: - Summary: Duplicate rows returned while hbase snapshot reads Key: HBASE-20844 URL: https://issues.apache.org/jira/browse/HBASE-20844 Project: HBase Issue Type: Bug Components: mapreduce, spark Affects Versions: 1.3.1 Environment: Cluster Details Java1.7 Hbase 1.3.1 Spark 1.6.1 Reporter: ShivaKumar SS We are trying to take snapshot from code and read data using MR and spark, both approaches are returning duplicate records. On the API side, {{org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormat }} is used. Snapshot was taken during the table is being in the region split state. We suspect it is due to data is being returned for both parent and daughter regions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)