Re: performance issue involving "insert as select"

Todd Lipcon Sun, 11 Dec 2016 21:52:33 -0800

Hi Rotem,

On Thu, Dec 8, 2016 at 3:25 AM, Rotem Gabay <[email protected]> wrote:


> Hi, I have  a small cluster on which I tried to run some performance tests
> on kudu, In order to populate some data I have made simple "insert as
> select" from simple HDFS table that took 10 minutes to finish. I then tried
> to duplicate the same data by doing another insert as select from the kudu
> table to itself ( insert into kudu_tbl select * from kudu_tbl), this insert
> took more then 2 hours to complete. Is there ant reasonable explaination ?
>

One interesting aspect of current releases of Kudu is that Impala queries
don't operate with snapshot consistency. In the case that you are writing
into the same table that you are reading from, it's actually possible that
the query reads its own results.

Put another way, one fragment of the query may be writing into a tablet
while another fragment is still reading that tablet. Without snapshot
consistency, it's actually possible for this to create a sort of "infinite
loop" of inserts. While usually not infinite, it can end up producing far
more rows than you expected.

We're working on addressing this in upcoming releases. In the meantime,
it's probably best to generate your data in a different fashion rather than
inserting into the same table that you're reading from.

Hope that helps. Let us know if the explanation doesn't seem to match up
with what you're seeing.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Re: performance issue involving "insert as select"

Reply via email to