I'm more thinking in terms of the startup IO having some impact on the co-located services, but we really need to know what "went down" means.
On Sat, Dec 16, 2017 at 12:50 PM, Boris Tyukin <bo...@boristyukin.com> wrote: > yep it is really weird since Kudu does not use neither one. I'll get with > him on Monday to gather more details > > On Sat, Dec 16, 2017 at 3:28 PM, Jean-Daniel Cryans <jdcry...@apache.org> > wrote: > >> Hi Boris, >> >> How exactly did HDFS and ZK go down? A Kudu restart is fairly >> IO-intensive but I don't know how that can cause things like DataNodes to >> fail. >> >> J-D >> >> On Sat, Dec 16, 2017 at 11:45 AM, Boris Tyukin <bo...@boristyukin.com> >> wrote: >> >>> well our admin had fun two days - it was the first time we restarted >>> Kudu on our DEV cluster and it did not go well. He is still troubleshooting >>> what happened but after Kudu restart zookeeper and HDFS went down after 3-4 >>> minutes. If we disable Kudu, all is well. No error in Kudu logs...I will >>> have more details next week so not asking for help as I do not know all the >>> details. What is obvious thought is that it has to do something with Kudu :) >>> >>> On Thu, Dec 14, 2017 at 9:40 AM, Boris Tyukin <bo...@boristyukin.com> >>> wrote: >>> >>>> thanks for your suggestions, J-D, I am sure you are right more often >>>> than that! :)) >>>> >>>> I will report back with our results. So far I am really impressed with >>>> Kudu - we have been benchmarking ingest and egress throughput and our >>>> typical queries runtime. The biggest pain so far is lack of support for >>>> decimals >>>> >>>> On Wed, Dec 13, 2017 at 5:07 PM, Jean-Daniel Cryans < >>>> jdcry...@apache.org> wrote: >>>> >>>>> On Wed, Dec 13, 2017 at 11:30 AM, Boris Tyukin <bo...@boristyukin.com> >>>>> wrote: >>>>> >>>>>> thanks J-D! we are going to try that and see how it impacts the >>>>>> runtime. >>>>>> >>>>>> is there any way to load this metadata upfront? a lot of our queries >>>>>> are adhoc in nature but they will be hitting the same tables with >>>>>> different >>>>>> predicates and join patterns though. >>>>>> >>>>> >>>>> You could use Impala to compute all the stats of all the tables after >>>>> each Kudu restart. Actually, do try that, restart Kudu then compute stats >>>>> and see how fast it scans. >>>>> >>>>> >>>>>> >>>>>> I am curious why this metadata does not survive restarts though. We >>>>>> are going to run our benchmarks again and this time restart Kudu and >>>>>> Impala. >>>>>> >>>>> >>>>> It's in the tserver memory, it can't survive a restart. >>>>> >>>>> >>>>>> >>>>>> I just ran another query first time which hits 2 large tables and >>>>>> these tables have been scanned by the previous query and this time I do >>>>>> not >>>>>> see any difference in query time before the first and second time - I >>>>>> guess >>>>>> this confirms your statement about " first time ever scanning the >>>>>> table since a Kudu restart" and collecting metadata. >>>>>> >>>>> >>>>> Maybe, I've been known to be right once or twice a year :) >>>>> >>>>> >>>>>> >>>>>> >>>>>> On Wed, Dec 13, 2017 at 11:18 AM, Jean-Daniel Cryans < >>>>>> jdcry...@apache.org> wrote: >>>>>> >>>>>>> Hi Boris, >>>>>>> >>>>>>> Given that we don't have much data we can use here, I'll have to >>>>>>> extrapolate. As an aside though, this is yet another example where we >>>>>>> need >>>>>>> more Kudu-side metrics in the query profile. >>>>>>> >>>>>>> So, Kudu lazily loads a bunch of metadata and that can really affect >>>>>>> scan times. If this was your first time ever scanning the table since a >>>>>>> Kudu restart, it's very possible that that's where that time was spent. >>>>>>> There's also the page cache in the OS that might now be populated. You >>>>>>> could do something like "sync; echo 3 > /proc/sys/vm/drop_caches" on all >>>>>>> the machines and run the query 2 times again, without restarting Kudu, >>>>>>> to >>>>>>> understand the effect of the page cache itself. There's currently now >>>>>>> way >>>>>>> to purge the cached metadata in Kudu though. >>>>>>> >>>>>>> Hope this helps a bit, >>>>>>> >>>>>>> J-D >>>>>>> >>>>>>> On Wed, Dec 13, 2017 at 8:07 AM, Boris Tyukin <bo...@boristyukin.com >>>>>>> > wrote: >>>>>>> >>>>>>>> Hi guys, >>>>>>>> >>>>>>>> I am doing some benchmarks with Kudu and Impala/Parquet and hope to >>>>>>>> share it soon but there is one thing that bugs me. This is perhaps >>>>>>>> Impala >>>>>>>> question but since I am using Kudu with Impala I am going to try and >>>>>>>> ask >>>>>>>> anyway. >>>>>>>> >>>>>>>> One of my queries takes 120 seconds to run the very first time. It >>>>>>>> joins one large 5B row table with a bunch of smaller tables and then >>>>>>>> stores >>>>>>>> result in Impala/parquet (not Kudu). >>>>>>>> >>>>>>>> Now if I run it second and third time, it only takes 60 seconds. >>>>>>>> Can someone explain why? Is there any settings to decrease this gap? >>>>>>>> >>>>>>>> I've compared query profiles in CM and the only thing that was very >>>>>>>> different is scan against Kudu table (the large one): >>>>>>>> >>>>>>>> *************************** >>>>>>>> first time: >>>>>>>> *************************** >>>>>>>> KUDU_SCAN_NODE (id=0) (47.68s) >>>>>>>> <https://lkmaorabd103.multihosp.net:7183/cmf/impala/queryDetails?queryId=5143f7165be82819%3Ae00a103500000000&serviceName=impala#> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> - BytesRead: *0 B* >>>>>>>> - InactiveTotalTime: *0ns* >>>>>>>> - KuduRemoteScanTokens: *0* >>>>>>>> - NumScannerThreadsStarted: *20* >>>>>>>> - PeakMemoryUsage: *35.8 MiB* >>>>>>>> - RowsRead: *693,502,241* >>>>>>>> - RowsReturned: *693,502,241* >>>>>>>> - RowsReturnedRate: *14643448 per second* >>>>>>>> - ScanRangesComplete: *20* >>>>>>>> - ScannerThreadsInvoluntaryContextSwitches: *1,341* >>>>>>>> - ScannerThreadsTotalWallClockTime: *36.2m* >>>>>>>> - MaterializeTupleTime(*): *47.57s* >>>>>>>> - ScannerThreadsSysTime: *31.42s* >>>>>>>> - ScannerThreadsUserTime: *1.7m* >>>>>>>> - ScannerThreadsVoluntaryContextSwitches: *96,855* >>>>>>>> - TotalKuduScanRoundTrips: *52,308* >>>>>>>> - TotalReadThroughput: *0 B/s* >>>>>>>> - TotalTime: *47.68s* >>>>>>>> >>>>>>>> >>>>>>>> *************************** >>>>>>>> second time: >>>>>>>> *************************** >>>>>>>> KUDU_SCAN_NODE (id=0) (4.28s) >>>>>>>> <https://lkmaorabd103.multihosp.net:7183/cmf/impala/queryDetails?queryId=53497a308f860837%3A243772e000000000&serviceName=impala#> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> - BytesRead: *0 B* >>>>>>>> - InactiveTotalTime: *0ns* >>>>>>>> - KuduRemoteScanTokens: *0* >>>>>>>> - NumScannerThreadsStarted: *20* >>>>>>>> - PeakMemoryUsage: *37.9 MiB* >>>>>>>> - RowsRead: *693,502,241* >>>>>>>> - RowsReturned: *693,502,241* >>>>>>>> - RowsReturnedRate: *173481534 per second* >>>>>>>> - ScanRangesComplete: *20* >>>>>>>> - ScannerThreadsInvoluntaryContextSwitches: *1,451* >>>>>>>> - ScannerThreadsTotalWallClockTime: *19.5m* >>>>>>>> - MaterializeTupleTime(*): *4.20s* >>>>>>>> - ScannerThreadsSysTime: *38.22s* >>>>>>>> - ScannerThreadsUserTime: *1.7m* >>>>>>>> - ScannerThreadsVoluntaryContextSwitches: *480,870* >>>>>>>> - TotalKuduScanRoundTrips: *52,142* >>>>>>>> - TotalReadThroughput: *0 B/s* >>>>>>>> - TotalTime: *4.28s* >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >