Re: Inconsistent read performance with Spark

2019-02-14 Thread Hao Hao
Hi Faraz, Yes, the order can help with both write and scan performance in your case. When the inserts are random (as you said the order of IDs is random), there will be many rowsets that overlap in primary key bounds, which maintenance manager needs to allocate resource to compact. And you will

Re: Inconsistent read performance with Spark

2019-02-14 Thread Faraz Mateen
Hao, The order of my primary key is (ID, datetime). My query had 'WHERE' clause for both these keys. How does the order exactly affect scan performance? I think restarting the tablet server removed all previous records on scan dashboard. I can't find any query that took too long to complete. On

Re: Inconsistent read performance with Spark

2019-02-13 Thread Hao Hao
Hi Faraz, What is the order of your primary key? Is it (datetime, ID) or (ID, datatime)? On the contrary, I suspect your scan performance got better for the same query because compaction happened in between, and thus there were less blocks to scan. Also would you mind sharing the screen shot of

Re: Inconsistent read performance with Spark

2019-02-13 Thread Faraz Mateen
Thanks a lot for the help, Hao. Response Inline: You can use tablet server web UI scans dashboard (/scans) to get a better > understanding of the ongoing/past queries. The flag 'scan_history_count' is > used to configure the size of the buffer. From there, you can get > information such as the

Re: Inconsistent read performance with Spark

2019-02-12 Thread Hao Hao
Hi Faraz, Answered inline below. Best, Hao On Tue, Feb 12, 2019 at 6:59 AM Faraz Mateen wrote: > Hi all, > > I am using spark to pull data from my single node testing kudu setup and > publish it to kafka. However, my query time is not consistent. > > I am querying a table with around *1.1

Inconsistent read performance with Spark

2019-02-12 Thread Faraz Mateen
Hi all, I am using spark to pull data from my single node testing kudu setup and publish it to kafka. However, my query time is not consistent. I am querying a table with around *1.1 million *packets. Initially my query was taking* 537 seconds to read 51042 records* from kudu and write them to