Re: Inconsistent read performance with Spark

Faraz Mateen Wed, 13 Feb 2019 06:25:36 -0800

Thanks a lot for the help, Hao.

Response Inline:


You can use tablet server web UI scans dashboard (/scans) to get a better
> understanding of the ongoing/past queries. The flag 'scan_history_count' is
> used to configure the size of the buffer. From there, you can get
> information such as the applied predicates and column stats for the
> selected columns.


Thanks. I did not know about this.

Did you notice any compactions in Kudu between you issued the two queries?
> What is your ingest pattern, are you inserting data in random primary key
> order?


The table has hash partitioning on a ID column that can have 15 different
values and range partition on datetime which is split monthly. Both ID and
datetime are my primary keys. The data we ingest is in increasing order of
time (usually) but the order of IDs is random.

However, ingestion into kudu was stopped while I was performing these
queries. I did not notice any compaction either.

On Wed, Feb 13, 2019 at 2:15 AM Hao Hao <hao....@cloudera.com> wrote:

> Hi Faraz,
>
> Answered inline below.
>
> Best,
> Hao
>
> On Tue, Feb 12, 2019 at 6:59 AM Faraz Mateen <fmat...@an10.io> wrote:
>
>> Hi all,
>>
>> I am using spark to pull data from my single node testing kudu setup and
>> publish it to kafka. However, my query time is not consistent.
>>
>> I am querying a table with around *1.1 million *packets. Initially my
>> query was taking* 537 seconds to read 51042 records* from kudu and write
>> them to kafka. This rate was quite low than what I had expected. I had
>> around 45 tables with little data in them that was not needed anymore. I
>> deleted all those tables, restarted spark session and attempted the same
>> query. Now the query completed in* 5.3 seconds*.
>>
>> I increased the number of rows to be fetched and tried the same query.
>> Rows count was *118741* but it took *1861 seconds *to complete. During
>> the query, resource utilization of my servers was very low. When I
>> attempted the same query again after a couple of hours, it took only* 16
>> secs*.
>>
>> After this I kept increasing the number of rows to be fetched and the
>> time kept increasing in linear fashion.
>>
>> What I want to ask is:
>>
>>    - How can I debug why the time for these queries is varying so much?
>>    I am not able to get anything out of Kudu logs.
>>
>> You can use tablet server web UI scans dashboard (/scans) to get a better
> understanding of the ongoing/past queries. The flag 'scan_history_count' is
> used to configure the size of the buffer. From there, you can get
> information such as the applied predicates and column stats for the
> selected columns.
>
>
>>
>>    - I am running kudu with default configurations. Are there any tweaks
>>    I should perform to boast the performance of my setup?
>>
>> Did you notice any compactions in Kudu between you issued the two
> queries? What is your ingest pattern, are you inserting data in random
> primary key order?
>
>>
>>    - Does having a lot of tables cause performance issues?
>>
>> If there is no hitting of resource limitation due to writes/scans to the
> other tables, they shouldn't affect the performance of your queries. Just
> FYI, this is the scale guide
> <https://kudu.apache.org/docs/scaling_guide.html> with respect to various
> system resources.
>
>>
>>    - Will having more masters and tservers improve my query time?
>>
>> Master is not likely to be the bottleneck, as client communicate directly
> to tserver for query once he/she knows which tserver to talk to. But
> separating master and tserver to be on the same node might help. This is
> the scale limitation
> <https://kudu.apache.org/docs/known_issues.html#_scale> guide for roughly
> estimation of number of tservers required for a given quantity of data.
>
> *Environment Details:*
>>
>>    - Single node Kudu 1.7 master and tserver. Server has 4 vCPUs and 16
>>    GB RAM.
>>    - Table that I am querying is hash partitioned on the basis of an ID
>>    with 3 buckets. It is also range partitioned on the basis of datetime with
>>    a new partition for each month.
>>    - Kafka version 1.1.
>>    - Standalone Spark 2.3.0 deployed on a server with 2 vCPU 4 GB RAM.
>>
>> --
>> Faraz Mateen
>>
>

-- 
Faraz Mateen

Re: Inconsistent read performance with Spark

Reply via email to