See below more I found on item 3. Cheers.
---- Saad On Sat, Mar 10, 2018 at 7:17 PM, Saad Mufti <saad.mu...@gmail.com> wrote: > Hi, > > I am running a Spark job (Spark 2.2.1) on an EMR cluster in AWS. There is > no Hbase installed on the cluster, only HBase libs linked to my Spark app. > We are reading the snapshot info from a HBase folder in S3 using > TableSnapshotInputFormat class from HBase 1.4.0 to have the Spark job read > snapshot info directly from the S3 based filesystem instead of going > through any region server. > > I have observed a few behaviors while debugging performance that are > concerning, some we could mitigate and other I am looking for clarity on: > > 1) the TableSnapshotInputFormatImpl code is trying to get locality > information for the region splits, for a snapshots with a large number of > files (over 350000 in our case) this causing single threaded scan of all > the file listings in a single thread in the driver. And it was useless > because there is really no useful locality information to glean since all > the files are in S3 and not HDFS. So I was forced to make a copy of > TableSnapshotInputFormatImpl.java in our code and control this with a > config setting I made up. That got rid of the hours long scan, so I am good > with this part for now. > > 2) I have set a single column family in the Scan that I set on the hbase > configuration via > > scan.addFamily(str.getBytes())) > > hBaseConf.set(TableInputFormat.SCAN, convertScanToString(scan)) > > > But when this code is executing under Spark and I observe the threads and > logs on Spark executors, I it is reading from S3 files for a column family > that was not included in the scan. This column family was intentionally > excluded because it is much larger than the others and so we wanted to > avoid the cost. > > Any advice on what I am doing wrong would be appreciated. > > 3) We also explicitly set caching of blocks to false on the scan, although > I see that in TableSnapshotInputFormatImpl.java it is again set to false > internally also. But when running the Spark job, some executors were taking > much longer than others, and when I observe their threads, I see periodic > messages about a few hundred megs of RAM used by the block cache, and the > thread is sitting there reading data from S3, and is occasionally blocked a > couple of other threads that have the "hfile-prefetcher" name in them. > Going back to 2) above, they seem to be reading the wrong column family, > but in this item I am more concerned about why they appear to be > prefetching blocks and caching them, when the Scan object has a setting to > not cache blocks at all? > I think I figured out item 3, the column family descriptor for the table in question has prefetch on open set in its schema. Now for the Spark job, I don't think this serves any useful purpose does it? But I can't see any way to override it. If these is, I'd appreciate some advice. Thanks. > > Thanks in advance for any insights anyone can provide. > > ---- > Saad > >