Re: Cache'ing performance

2016-08-27 Thread Kazuaki Ishizaki
Hi, Good point. I have just measured performance with "spark.sql.inMemoryColumnarStorage.compressed=false." It improved the performance than with default. However, it is still slower RDD version on my environment. It seems to be consistent with the PR

Re: Cache'ing performance

2016-08-27 Thread linguin . m . s
Hi, How does the performance difference change when turning off compression? It is enabled by default. // maropu Sent by iPhone 2016/08/28 10:13、Kazuaki Ishizaki のメッセージ: > Hi > I think that it is a performance issue in both DataFrame and Dataset cache. > It is not due

Re: Cache'ing performance

2016-08-27 Thread Kazuaki Ishizaki
Hi I think that it is a performance issue in both DataFrame and Dataset cache. It is not due to only Encoders. The DataFrame version "spark.range(Int.MaxValue).toDF.cache().count()" is also slow. While a cache for DataFrame and Dataset is stored as a columnar format with some compressed data

Re: Structured Streaming with Kafka sources/sinks

2016-08-27 Thread Koert Kuipers
thats great is this effort happening anywhere that is publicly visible? github? On Tue, Aug 16, 2016 at 2:04 AM, Reynold Xin wrote: > We (the team at Databricks) are working on one currently. > > > On Mon, Aug 15, 2016 at 7:26 PM, Cody Koeninger >

Cache'ing performance

2016-08-27 Thread Maciej Bryński
Hi, I did some benchmark of cache function today. *RDD* sc.parallelize(0 until Int.MaxValue).cache().count() *Datasets* spark.range(Int.MaxValue).cache().count() For me Datasets was 2 times slower. Results (3 nodes, 20 cores and 48GB RAM each) *RDD - 6s* *Datasets - 13,5 s* Is that expected

Re: Performance of loading parquet files into case classes in Spark

2016-08-27 Thread Maciej Bryński
2016-08-27 15:27 GMT+02:00 Julien Dumazert : > df.map(row => row.getAs[Long]("fieldToSum")).reduce(_ + _) I think reduce and sum has very different performance. Did you try sql.functions.sum ? Or of you want to benchmark access to Row object then count() function

Performance of loading parquet files into case classes in Spark

2016-08-27 Thread Julien Dumazert
Hi all, I'm forwarding a question I recently asked on Stack Overflow about benchmarking Spark performance when working with case classes stored in Parquet files. I am assessing the

Re: Assembly build on spark 2.0.0

2016-08-27 Thread Radoslaw Gruchalski
Ah, an uberjar. Normally one would build the uberjar with a Maven Shade plugin. Haven't looked into Spark code much recently, it wouldn't make much sense having a separate maven command to build an uberjar while building a distribution because, from memory, if you open the tgz file, the uberjar

Re: Assembly build on spark 2.0.0

2016-08-27 Thread Srikanth Sampath
Found the answer. This is the reason https://issues.apache.org/jira/browse/SPARK-11157 -Srikanth On Sat, Aug 27, 2016 at 8:54 AM, Srikanth Sampath wrote: > Hi, > Thanks Radek. However mvn package does not build the uber jar. I am > looking for an uber jar and not