Re: Does Apache Spark maintain a columnar structure when creating RDDs from Parquet or ORC files?

Cheng Lian Sun, 07 Jun 2015 07:01:41 -0700

For the following code:

    val df = sqlContext.parquetFile(path)

`df` remains columnar (actually it just reads from the columnar Parquetfile on disk). For the following code:


    val cdf = df.cache()

`cdf` is also columnar but that's different from Parquet. When aDataFrame is cached, Spark SQL turns it into a private in-memorycolumnar format.


So for your last question, the answer is: yes.

Cheng

On 6/3/15 10:58 PM, lonikar wrote:

When spark reads parquet files (sqlContext.parquetFile), it creates a
DataFrame RDD. I would like to know if the resulting DataFrame has columnar
structure (many rows of a column coalesced together in memory) or its a row
wise structure that a spark RDD has. The section  Spark SQL and DataFrames
<http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory>
says you need to call sqlContext.cacheTable("tableName") or df.cache() to
make it columnar. What exactly is this columnar structure?

To be precise: What does the row represent in the expression
df.cache().map{row => ...}?

Is it a logical row which maintains an array of columns and each column in
turn is an array of values for batchSize rows?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-Apache-Spark-maintain-a-columnar-structure-when-creating-RDDs-from-Parquet-or-ORC-files-tp23139.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Does Apache Spark maintain a columnar structure when creating RDDs from Parquet or ORC files?

Reply via email to