Re: Inconsistent read performance with Spark

Hao Hao Tue, 12 Feb 2019 13:15:42 -0800

Hi Faraz,

Answered inline below.


Best,
Hao

On Tue, Feb 12, 2019 at 6:59 AM Faraz Mateen <[email protected]> wrote:

> Hi all,
>
> I am using spark to pull data from my single node testing kudu setup and
> publish it to kafka. However, my query time is not consistent.
>
> I am querying a table with around *1.1 million *packets. Initially my
> query was taking* 537 seconds to read 51042 records* from kudu and write
> them to kafka. This rate was quite low than what I had expected. I had
> around 45 tables with little data in them that was not needed anymore. I
> deleted all those tables, restarted spark session and attempted the same
> query. Now the query completed in* 5.3 seconds*.
>
> I increased the number of rows to be fetched and tried the same query.
> Rows count was *118741* but it took *1861 seconds *to complete. During
> the query, resource utilization of my servers was very low. When I
> attempted the same query again after a couple of hours, it took only* 16
> secs*.
>
> After this I kept increasing the number of rows to be fetched and the time
> kept increasing in linear fashion.
>
> What I want to ask is:
>
>    - How can I debug why the time for these queries is varying so much? I
>    am not able to get anything out of Kudu logs.
>
> You can use tablet server web UI scans dashboard (/scans) to get a better
understanding of the ongoing/past queries. The flag 'scan_history_count' is
used to configure the size of the buffer. From there, you can get
information such as the applied predicates and column stats for the
selected columns.


>
>    - I am running kudu with default configurations. Are there any tweaks
>    I should perform to boast the performance of my setup?
>
> Did you notice any compactions in Kudu between you issued the two queries?
What is your ingest pattern, are you inserting data in random primary key
order?

>
>    - Does having a lot of tables cause performance issues?
>
> If there is no hitting of resource limitation due to writes/scans to the
other tables, they shouldn't affect the performance of your queries. Just
FYI, this is the scale guide
<https://kudu.apache.org/docs/scaling_guide.html> with respect to various
system resources.

>
>    - Will having more masters and tservers improve my query time?
>
> Master is not likely to be the bottleneck, as client communicate directly
to tserver for query once he/she knows which tserver to talk to. But
separating master and tserver to be on the same node might help. This is
the scale limitation
<https://kudu.apache.org/docs/known_issues.html#_scale> guide
for roughly estimation of number of tservers required for a given quantity
of data.

*Environment Details:*
>
>    - Single node Kudu 1.7 master and tserver. Server has 4 vCPUs and 16
>    GB RAM.
>    - Table that I am querying is hash partitioned on the basis of an ID
>    with 3 buckets. It is also range partitioned on the basis of datetime with
>    a new partition for each month.
>    - Kafka version 1.1.
>    - Standalone Spark 2.3.0 deployed on a server with 2 vCPU 4 GB RAM.
>
> --
> Faraz Mateen
>

Re: Inconsistent read performance with Spark

Reply via email to