[ https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon reopened SPARK-21177: ---------------------------------- I am reopening this as I can reproduce: {code} def printTimeTaken(str: String, f: () => Unit) { val start = System.nanoTime() f() val end = System.nanoTime() val timetaken = end - start import scala.concurrent.duration._ println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n") } for(i <- 1 to 1000) { printTimeTaken("time to append to hive:", () => { Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); }) } {code} {code} ... Time taken for time to append to hive: is 166 Time taken for time to append to hive: is 155 Time taken for time to append to hive: is 164 ... Time taken for time to append to hive: is 374 Time taken for time to append to hive: is 360 Time taken for time to append to hive: is 377 {code} {code} scala> sql("SELECT count(*) from t1").show() +--------+ |count(1)| +--------+ | 2000| +--------+ {code} What I did is with {{format("hive")}} : {code} hive> create table t1 (value bigint); {code} Output: {code} ... Time taken for time to append to hive: is 593 Time taken for time to append to hive: is 587 Time taken for time to append to hive: is 580 ... Time taken for time to append to hive: is 506 Time taken for time to append to hive: is 511 Time taken for time to append to hive: is 507 {code} {code} scala> sql("SELECT count(*) from t1").show() +--------+ |count(1)| +--------+ | 2000| +--------+ {code} > df.saveAsTable slows down linearly, with number of appends > ---------------------------------------------------------- > > Key: SPARK-21177 > URL: https://issues.apache.org/jira/browse/SPARK-21177 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Reporter: Prashant Sharma > > In short, please use the following shell transcript for the reproducer. > {code:java} > Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT > /_/ > > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91) > Type in expressions to have them evaluated. > Type :help for more information. > scala> def printTimeTaken(str: String, f: () => Unit) { > val start = System.nanoTime() > f() > val end = System.nanoTime() > val timetaken = end - start > import scala.concurrent.duration._ > println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n") > } > | | | | | | | printTimeTaken: (str: > String, f: () => Unit)Unit > scala> > for(i <- 1 to 100000) {printTimeTaken("time to append to hive:", () => { > Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })} > Time taken for time to append to hive: is 284 > Time taken for time to append to hive: is 211 > ... > ... > Time taken for time to append to hive: is 2615 > ... > Time taken for time to append to hive: is 3055 > ... > Time taken for time to append to hive: is 22425 > .... > {code} > Why does it matter ? > In a streaming job it is not possible to append to hive using this dataframe > operation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org