Spark DataFrame limit question

Arkadiusz Bicz Wed, 06 Jan 2016 04:08:39 -0800

Hi,

Does limit working for DataFrames, Spark SQL and Hive Context without
full scan for parquet in Spark 1.6 ?


I just used it to create small parquet file from large number of
parquet files and found out that it doing full scan of all data
instead just read limited number:

All of bellow commands doing full scan

val results = sqlContext.read.load("/largenumberofparquetfiles/")

results.limit(1).write.parquet("/tmp/smallresults1")

result.registerTempTable("resultTemp")

val select = sqlContext.sql("select * from resultTemp limit 1")

select.write.parquet("/tmp/smallresults2")

The same when I create external table in hive context as results table

hiveContext.sql("select * from results limit
1").write.parquet("/tmp/results/one3")


Thanks,

Arkadiusz Bicz

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark DataFrame limit question

Reply via email to