Great, thank you very much. I was confused because this is in the docs:
https://spark.apache.org/docs/1.2.0/sql-programming-guide.html, and on the
branch-1.2 branch,
https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md
Note that if you call schemaRDD.cache() rather than
Thanks for your response. So AFAICT
calling parallelize(1 to1024).map(i =KV(i,
i.toString)).toSchemaRDD.cache().count(), will allow me to see the size of
the schemardd in memory
and parallelize(1 to1024).map(i =KV(i, i.toString)).cache().count() will
show me the size of a regular rdd.
But
Hi,
I want to benchmark the memory savings by using the in-memory columnar
storage for schemardds (using cacheTable) vs caching the SchemaRDD directly.
It would be really helpful to be able to query this from the spark-shell or
jobs directly. Could a dev point me to the way to do this? From what
I frequently encounter problems building Spark as a dependency in java
projects because of version conflicts with other dependencies. Usually there
will be two different versions of a library and we'll see an
AbstractMethodError or invalid signature etc.
So far, I've seen it happen with jackson,
Hi,
Did you ever figure this one out? I'm seeing the same behavior:
Calling cache() after a repartition() makes Spark cache the version of the
RDD BEFORE the repartition, which means a shuffle everytime it is accessed..
However, calling cache before the repartition() seems to work fine, the