thanks for your suggestions, J-D, I am sure you are right more often than that! :))
I will report back with our results. So far I am really impressed with Kudu - we have been benchmarking ingest and egress throughput and our typical queries runtime. The biggest pain so far is lack of support for decimals On Wed, Dec 13, 2017 at 5:07 PM, Jean-Daniel Cryans <[email protected]> wrote: > On Wed, Dec 13, 2017 at 11:30 AM, Boris Tyukin <[email protected]> > wrote: > >> thanks J-D! we are going to try that and see how it impacts the runtime. >> >> is there any way to load this metadata upfront? a lot of our queries are >> adhoc in nature but they will be hitting the same tables with different >> predicates and join patterns though. >> > > You could use Impala to compute all the stats of all the tables after each > Kudu restart. Actually, do try that, restart Kudu then compute stats and > see how fast it scans. > > >> >> I am curious why this metadata does not survive restarts though. We are >> going to run our benchmarks again and this time restart Kudu and Impala. >> > > It's in the tserver memory, it can't survive a restart. > > >> >> I just ran another query first time which hits 2 large tables and these >> tables have been scanned by the previous query and this time I do not see >> any difference in query time before the first and second time - I guess >> this confirms your statement about " first time ever scanning the table >> since a Kudu restart" and collecting metadata. >> > > Maybe, I've been known to be right once or twice a year :) > > >> >> >> On Wed, Dec 13, 2017 at 11:18 AM, Jean-Daniel Cryans <[email protected] >> > wrote: >> >>> Hi Boris, >>> >>> Given that we don't have much data we can use here, I'll have to >>> extrapolate. As an aside though, this is yet another example where we need >>> more Kudu-side metrics in the query profile. >>> >>> So, Kudu lazily loads a bunch of metadata and that can really affect >>> scan times. If this was your first time ever scanning the table since a >>> Kudu restart, it's very possible that that's where that time was spent. >>> There's also the page cache in the OS that might now be populated. You >>> could do something like "sync; echo 3 > /proc/sys/vm/drop_caches" on all >>> the machines and run the query 2 times again, without restarting Kudu, to >>> understand the effect of the page cache itself. There's currently now way >>> to purge the cached metadata in Kudu though. >>> >>> Hope this helps a bit, >>> >>> J-D >>> >>> On Wed, Dec 13, 2017 at 8:07 AM, Boris Tyukin <[email protected]> >>> wrote: >>> >>>> Hi guys, >>>> >>>> I am doing some benchmarks with Kudu and Impala/Parquet and hope to >>>> share it soon but there is one thing that bugs me. This is perhaps Impala >>>> question but since I am using Kudu with Impala I am going to try and ask >>>> anyway. >>>> >>>> One of my queries takes 120 seconds to run the very first time. It >>>> joins one large 5B row table with a bunch of smaller tables and then stores >>>> result in Impala/parquet (not Kudu). >>>> >>>> Now if I run it second and third time, it only takes 60 seconds. Can >>>> someone explain why? Is there any settings to decrease this gap? >>>> >>>> I've compared query profiles in CM and the only thing that was very >>>> different is scan against Kudu table (the large one): >>>> >>>> *************************** >>>> first time: >>>> *************************** >>>> KUDU_SCAN_NODE (id=0) (47.68s) >>>> <https://lkmaorabd103.multihosp.net:7183/cmf/impala/queryDetails?queryId=5143f7165be82819%3Ae00a103500000000&serviceName=impala#> >>>> >>>> >>>> >>>> - BytesRead: *0 B* >>>> - InactiveTotalTime: *0ns* >>>> - KuduRemoteScanTokens: *0* >>>> - NumScannerThreadsStarted: *20* >>>> - PeakMemoryUsage: *35.8 MiB* >>>> - RowsRead: *693,502,241* >>>> - RowsReturned: *693,502,241* >>>> - RowsReturnedRate: *14643448 per second* >>>> - ScanRangesComplete: *20* >>>> - ScannerThreadsInvoluntaryContextSwitches: *1,341* >>>> - ScannerThreadsTotalWallClockTime: *36.2m* >>>> - MaterializeTupleTime(*): *47.57s* >>>> - ScannerThreadsSysTime: *31.42s* >>>> - ScannerThreadsUserTime: *1.7m* >>>> - ScannerThreadsVoluntaryContextSwitches: *96,855* >>>> - TotalKuduScanRoundTrips: *52,308* >>>> - TotalReadThroughput: *0 B/s* >>>> - TotalTime: *47.68s* >>>> >>>> >>>> *************************** >>>> second time: >>>> *************************** >>>> KUDU_SCAN_NODE (id=0) (4.28s) >>>> <https://lkmaorabd103.multihosp.net:7183/cmf/impala/queryDetails?queryId=53497a308f860837%3A243772e000000000&serviceName=impala#> >>>> >>>> >>>> >>>> - BytesRead: *0 B* >>>> - InactiveTotalTime: *0ns* >>>> - KuduRemoteScanTokens: *0* >>>> - NumScannerThreadsStarted: *20* >>>> - PeakMemoryUsage: *37.9 MiB* >>>> - RowsRead: *693,502,241* >>>> - RowsReturned: *693,502,241* >>>> - RowsReturnedRate: *173481534 per second* >>>> - ScanRangesComplete: *20* >>>> - ScannerThreadsInvoluntaryContextSwitches: *1,451* >>>> - ScannerThreadsTotalWallClockTime: *19.5m* >>>> - MaterializeTupleTime(*): *4.20s* >>>> - ScannerThreadsSysTime: *38.22s* >>>> - ScannerThreadsUserTime: *1.7m* >>>> - ScannerThreadsVoluntaryContextSwitches: *480,870* >>>> - TotalKuduScanRoundTrips: *52,142* >>>> - TotalReadThroughput: *0 B/s* >>>> - TotalTime: *4.28s* >>>> >>>> >>>> >>>> >>>> >>> >> >
