Hi Rotem, On Thu, Dec 8, 2016 at 3:25 AM, Rotem Gabay <[email protected]> wrote:
> Hi, I have a small cluster on which I tried to run some performance tests > on kudu, In order to populate some data I have made simple "insert as > select" from simple HDFS table that took 10 minutes to finish. I then tried > to duplicate the same data by doing another insert as select from the kudu > table to itself ( insert into kudu_tbl select * from kudu_tbl), this insert > took more then 2 hours to complete. Is there ant reasonable explaination ? > One interesting aspect of current releases of Kudu is that Impala queries don't operate with snapshot consistency. In the case that you are writing into the same table that you are reading from, it's actually possible that the query reads its own results. Put another way, one fragment of the query may be writing into a tablet while another fragment is still reading that tablet. Without snapshot consistency, it's actually possible for this to create a sort of "infinite loop" of inserts. While usually not infinite, it can end up producing far more rows than you expected. We're working on addressing this in upcoming releases. In the meantime, it's probably best to generate your data in a different fashion rather than inserting into the same table that you're reading from. Hope that helps. Let us know if the explanation doesn't seem to match up with what you're seeing. -Todd -- Todd Lipcon Software Engineer, Cloudera
