Re: Inconsistent read performance with Spark

Faraz Mateen Thu, 14 Feb 2019 16:09:37 -0800

Hao,

The order of my primary key is (ID, datetime). My query had 'WHERE' clause
for both these keys. How does the order exactly affect scan performance?


I think restarting the tablet server removed all previous records on scan
dashboard. I can't find any query that took too long to complete.

On Thu, Feb 14, 2019 at 4:31 AM Hao Hao <[email protected]> wrote:

> Hi Faraz,
>
> What is the order of your primary key? Is it (datetime, ID) or (ID,
> datatime)?
>
> On the contrary, I suspect your scan performance got better for the same
> query because compaction happened in between, and thus there were less
> blocks to scan. Also would you mind sharing the screen shot of the tablet
> server web UI page when your scans took place (to do a comparison between
> the 'good' and 'bad' scans) ?
>
> Best,
> Hao
>
> On Wed, Feb 13, 2019 at 9:37 AM Faraz Mateen <[email protected]> wrote:
>
>> By "not noticing any compaction" I meant I did not see any visible change
>> in disk space. However, logs show that there were some compaction related
>> operations happening during this whole time period. These statements
>> appeared multiple times in tserver logs:
>>
>> W0211 13:44:10.991221 15822 tablet.cc:1679] T
>> 00b8818d0713485b83982ac56d9e342a P 7b44fc5229fe43e190d4d6c1e8022988: Can't
>> schedule compaction. Clean time has not been advanced past its initial
>> value.
>> ...
>> ...
>> I0211 14:36:33.883819 15822 maintenance_manager.cc:302] P
>> 7b44fc5229fe43e190d4d6c1e8022988: Scheduling
>> MajorDeltaCompactionOp(30c9aaadcb13460fab832bdea1104349): perf
>> score=0.106957
>> I0211 14:36:33.884233 13179 diskrowset.cc:560] T
>> 30c9aaadcb13460fab832bdea1104349 P 7b44fc5229fe43e190d4d6c1e8022988:
>> RowSet(3080): Major compacting REDO delta stores (cols: 2 3 4 5 6 7 9 10 11
>> 13 14 15 16 20 22 29 31 33 36 38 39 41 42 47 49 51 52 56 57 58 64 67 68 71
>> 75 77 78 79 80 81 109 128 137)
>>
>>
>> Does compaction affect scan performance? And if it does, what can I do to
>> limit this degradation?
>>
>>
>> On Wed, Feb 13, 2019 at 7:24 PM Faraz Mateen <[email protected]> wrote:
>>
>>> Thanks a lot for the help, Hao.
>>>
>>> Response Inline:
>>>
>>> You can use tablet server web UI scans dashboard (/scans) to get a
>>>> better understanding of the ongoing/past queries. The flag
>>>> 'scan_history_count' is used to configure the size of the buffer. From
>>>> there, you can get information such as the applied predicates and column
>>>> stats for the selected columns.
>>>
>>>
>>> Thanks. I did not know about this.
>>>
>>> Did you notice any compactions in Kudu between you issued the two
>>>> queries? What is your ingest pattern, are you inserting data in random
>>>> primary key order?
>>>
>>>
>>> The table has hash partitioning on a ID column that can have 15
>>> different values and range partition on datetime which is split monthly.
>>> Both ID and datetime are my primary keys. The data we ingest is in
>>> increasing order of time (usually) but the order of IDs is random.
>>>
>>> However, ingestion into kudu was stopped while I was performing these
>>> queries. I did not notice any compaction either.
>>>
>>> On Wed, Feb 13, 2019 at 2:15 AM Hao Hao <[email protected]> wrote:
>>>
>>>> Hi Faraz,
>>>>
>>>> Answered inline below.
>>>>
>>>> Best,
>>>> Hao
>>>>
>>>> On Tue, Feb 12, 2019 at 6:59 AM Faraz Mateen <[email protected]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I am using spark to pull data from my single node testing kudu setup
>>>>> and publish it to kafka. However, my query time is not consistent.
>>>>>
>>>>> I am querying a table with around *1.1 million *packets. Initially my
>>>>> query was taking* 537 seconds to read 51042 records* from kudu and
>>>>> write them to kafka. This rate was quite low than what I had expected. I
>>>>> had around 45 tables with little data in them that was not needed anymore.
>>>>> I deleted all those tables, restarted spark session and attempted the same
>>>>> query. Now the query completed in* 5.3 seconds*.
>>>>>
>>>>> I increased the number of rows to be fetched and tried the same query.
>>>>> Rows count was *118741* but it took *1861 seconds *to complete.
>>>>> During the query, resource utilization of my servers was very low. When
>>>>> I attempted the same query again after a couple of hours, it took only*
>>>>> 16 secs*.
>>>>>
>>>>> After this I kept increasing the number of rows to be fetched and the
>>>>> time kept increasing in linear fashion.
>>>>>
>>>>> What I want to ask is:
>>>>>
>>>>>    - How can I debug why the time for these queries is varying so
>>>>>    much? I am not able to get anything out of Kudu logs.
>>>>>
>>>>> You can use tablet server web UI scans dashboard (/scans) to get a
>>>> better understanding of the ongoing/past queries. The flag
>>>> 'scan_history_count' is used to configure the size of the buffer. From
>>>> there, you can get information such as the applied predicates and column
>>>> stats for the selected columns.
>>>>
>>>>
>>>>>
>>>>>    - I am running kudu with default configurations. Are there any
>>>>>    tweaks I should perform to boast the performance of my setup?
>>>>>
>>>>> Did you notice any compactions in Kudu between you issued the two
>>>> queries? What is your ingest pattern, are you inserting data in random
>>>> primary key order?
>>>>
>>>>>
>>>>>    - Does having a lot of tables cause performance issues?
>>>>>
>>>>> If there is no hitting of resource limitation due to writes/scans to
>>>> the other tables, they shouldn't affect the performance of your queries.
>>>> Just FYI, this is the scale guide
>>>> <https://kudu.apache.org/docs/scaling_guide.html> with respect to
>>>> various system resources.
>>>>
>>>>>
>>>>>    - Will having more masters and tservers improve my query time?
>>>>>
>>>>> Master is not likely to be the bottleneck, as client communicate
>>>> directly to tserver for query once he/she knows which tserver to talk to.
>>>> But separating master and tserver to be on the same node might help. This
>>>> is the scale limitation
>>>> <https://kudu.apache.org/docs/known_issues.html#_scale> guide for
>>>> roughly estimation of number of tservers required for a given quantity of
>>>> data.
>>>>
>>>> *Environment Details:*
>>>>>
>>>>>    - Single node Kudu 1.7 master and tserver. Server has 4 vCPUs and
>>>>>    16 GB RAM.
>>>>>    - Table that I am querying is hash partitioned on the basis of an
>>>>>    ID with 3 buckets. It is also range partitioned on the basis of 
>>>>> datetime
>>>>>    with a new partition for each month.
>>>>>    - Kafka version 1.1.
>>>>>    - Standalone Spark 2.3.0 deployed on a server with 2 vCPU 4 GB RAM.
>>>>>
>>>>> --
>>>>> Faraz Mateen
>>>>>
>>>>
>>>
>>> --
>>> Faraz Mateen
>>>
>>
>>
>> --
>> Faraz Mateen
>>
>

-- 
Faraz Mateen

Re: Inconsistent read performance with Spark

Reply via email to