I have create a company specific branch and added 4 new flags to control
this behavior, these gave us a huge performance boost when running Spark
jobs on snapshots of very large tables in S3. I tried to do everything
cleanly but

a) not being familiar with the whole test strategies I haven't had time to
add any useful tests, though of course I left the default behavior the
same, and a lot of the behavior I control wit these flags only affect
performance, not the final result, so I would need some pointers on how to
add useful tests
b) I added a new flag to be an overall override for prefetch behavior that
overrides any setting even in the column family descriptor, not sure if
what I did was entirely in the spirit of what HBase does

Again these if used properly would only impact jobs using
TableSnapshotInputFormat in their Spark or M-R jobs. Would someone from the
core team be willing to look at my patch? I have never done this before, so
would appreciate a quick pointer on how to send a patch and get some quick
feedback.

Cheers.

----
Saad



On Sat, Mar 10, 2018 at 9:56 PM, Saad Mufti <saad.mu...@gmail.com> wrote:

> The question remain though of why it is even accessing a column family's
> files that should be excluded based on the Scan. And that column family
> does NOT specify prefetch on open in its schema. Only the one we want to
> read specifies prefetch on open, which we want to override if possible for
> the Spark job.
>
> ----
> Saad
>
>
> On Sat, Mar 10, 2018 at 9:51 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
>> See below more I found on item 3.
>>
>> Cheers.
>>
>> ----
>> Saad
>>
>> On Sat, Mar 10, 2018 at 7:17 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am running a Spark job (Spark 2.2.1) on an EMR cluster in AWS. There
>>> is no Hbase installed on the cluster, only HBase libs linked to my Spark
>>> app. We are reading the snapshot info from a HBase folder in S3 using
>>> TableSnapshotInputFormat class from HBase 1.4.0 to have the Spark job read
>>> snapshot info directly from the S3 based filesystem instead of going
>>> through any region server.
>>>
>>> I have observed a few behaviors while debugging performance that are
>>> concerning, some we could mitigate and other I am looking for clarity on:
>>>
>>> 1)  the TableSnapshotInputFormatImpl code is trying to get locality
>>> information for the region splits, for a snapshots with a large number of
>>> files (over 350000 in our case) this causing single threaded scan of all
>>> the file listings in a single thread in the driver. And it was useless
>>> because there is really no useful locality information to glean since all
>>> the files are in S3 and not HDFS. So I was forced to make a copy of
>>> TableSnapshotInputFormatImpl.java in our code and control this with a
>>> config setting I made up. That got rid of the hours long scan, so I am good
>>> with this part for now.
>>>
>>> 2) I have set a single column family in the Scan that I set on the hbase
>>> configuration via
>>>
>>> scan.addFamily(str.getBytes()))
>>>
>>> hBaseConf.set(TableInputFormat.SCAN, convertScanToString(scan))
>>>
>>>
>>> But when this code is executing under Spark and I observe the threads
>>> and logs on Spark executors, I it is reading from S3 files for a column
>>> family that was not included in the scan. This column family was
>>> intentionally excluded because it is much larger than the others and so we
>>> wanted to avoid the cost.
>>>
>>> Any advice on what I am doing wrong would be appreciated.
>>>
>>> 3) We also explicitly set caching of blocks to false on the scan,
>>> although I see that in TableSnapshotInputFormatImpl.java it is again
>>> set to false internally also. But when running the Spark job, some
>>> executors were taking much longer than others, and when I observe their
>>> threads, I see periodic messages about a few hundred megs of RAM used by
>>> the block cache, and the thread is sitting there reading data from S3, and
>>> is occasionally blocked a couple of other threads that have the
>>> "hfile-prefetcher" name in them. Going back to 2) above, they seem to be
>>> reading the wrong column family, but in this item I am more concerned about
>>> why they appear to be prefetching blocks and caching them, when the Scan
>>> object has a setting to not cache blocks at all?
>>>
>>
>> I think I figured out item 3, the column family descriptor for the table
>> in question has prefetch on open set in its schema. Now for the Spark job,
>> I don't think this serves any useful purpose does it? But I can't see any
>> way to override it. If these is, I'd appreciate some advice.
>>
>
>> Thanks.
>>
>>
>>>
>>> Thanks in advance for any insights anyone can provide.
>>>
>>> ----
>>> Saad
>>>
>>>
>>
>>
>

Reply via email to