Hi, Does limit working for DataFrames, Spark SQL and Hive Context without full scan for parquet in Spark 1.6 ?
I just used it to create small parquet file from large number of parquet files and found out that it doing full scan of all data instead just read limited number: All of bellow commands doing full scan val results = sqlContext.read.load("/largenumberofparquetfiles/") results.limit(1).write.parquet("/tmp/smallresults1") result.registerTempTable("resultTemp") val select = sqlContext.sql("select * from resultTemp limit 1") select.write.parquet("/tmp/smallresults2") The same when I create external table in hive context as results table hiveContext.sql("select * from results limit 1").write.parquet("/tmp/results/one3") Thanks, Arkadiusz Bicz --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org