Spark performance tests

2017-01-09 Thread Prasun Ratn
Hi Are there performance tests or microbenchmarks for Spark - especially directed towards the CPU specific parts? I looked at spark-perf but that doesn't seem to have been updated recently. Thanks Prasun - To unsubscribe

Re: How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Liang-Chi Hsieh
Hi Andy, Because hash-based aggregate uses unsafe row as aggregation states, so the aggregation buffer schema must be mutable types in unsafe row. If you can use TypedImperativeAggregate to implement your aggregation function, SparkSQL has ObjectHashAggregateExec which supports hash-based

Re: Parquet patch release

2017-01-09 Thread Cheng Lian
Finished reviewing the list and it LGTM now (left comments in the spreadsheet and Ryan already made corresponding changes). Ryan - Thanks a lot for pushing this and making it happen! Cheng On 1/6/17 3:46 PM, Ryan Blue wrote: Last month, there was interest in a Parquet patch release on PR

Re: How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Andy Dang
Hi Takeshi, Thanks for the answer. My UDAF aggregates data into an array of rows. Apparently this makes it ineligible to using Hash-based aggregate based on the logic at:

Re: How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Takeshi Yamamuro
Hi, Spark always uses hash-based aggregates if the types of aggregated data are supported there; otherwise, spark fails to use hash-based ones, then it uses sort-based ones. See:

How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Andy Dang
Hi all, It appears to me that Dataset.groupBy().agg(udaf) requires a full sort, which is very inefficient for certain aggration: The code is very simple: - I have a UDAF - What I want to do is: dataset.groupBy(cols).agg(udaf).count() The physical plan I got was: *HashAggregate(keys=[],

Re: A note about MLlib's StandardScaler

2017-01-09 Thread Sean Owen
This could be true if you knew you were just going to scale the input to StandardScaler and nothing else. It's probably more typical you'd scale some other data. The current behavior is therefore the sensible default, because the input is a sample of some unknown larger population. I think it

Re: Spark checkpointing

2017-01-09 Thread Steve Loughran
On 7 Jan 2017, at 08:29, Felix Cheung > wrote: Thanks Steve. As you have pointed out, we have seen some issues related to cloud storage as "file system". I'm looking at checkpointing recently. What do you think would be the

Re: scala.MatchError: scala.collection.immutable.Range.Inclusive from catalyst.ScalaReflection.serializerFor?

2017-01-09 Thread Holden Karau
If you want to check if it's your modifications or just in mainline, you can always just checkout mainline or stash your current changes to rebuild (this is something I do pretty often when I run into bugs I don't think I would have introduced). On Mon, Jan 9, 2017 at 1:01 AM Liang-Chi Hsieh

Re: scala.MatchError: scala.collection.immutable.Range.Inclusive from catalyst.ScalaReflection.serializerFor?

2017-01-09 Thread Liang-Chi Hsieh
Hi, As Seq(0 to 8) is: scala> Seq(0 to 8) res1: Seq[scala.collection.immutable.Range.Inclusive] = List(Range(0, 1, 2, 3, 4, 5, 6, 7, 8)) Do you actually want to create a Dataset of Range? If so, I think currently ScalaReflection which the encoder relies doesn't support Range. Jacek

Re: handling of empty partitions

2017-01-09 Thread Georg Heiler
Hi Liang-Chi Hsieh, Strange: As the "toCarry" returned is the following when I tested your codes: Map(1 -> Some(FooBar(Some(2016-01-04),lastAssumingSameDate)), 0 -> Some(FooBar(Some(2016-01-02),second))) For me it always looked like: ## carry Map(2 -> None, 5 -> None, 4 ->

scala.MatchError: scala.collection.immutable.Range.Inclusive from catalyst.ScalaReflection.serializerFor?

2017-01-09 Thread Jacek Laskowski
Hi, Just got this this morning using the fresh build of Spark 2.2.0-SNAPSHOT (with a few local modifications): scala> Seq(0 to 8).toDF scala.MatchError: scala.collection.immutable.Range.Inclusive (of class scala.reflect.internal.Types$ClassNoArgsTypeRef) at

Re: handling of empty partitions

2017-01-09 Thread Liang-Chi Hsieh
The map "toCarry" will return you (partitionIndex, None) for empty partition. So I think line 51 won't fail. Line 58 can fail if "lastNotNullRow" is None. You of course should check if an Option has value or not before you access it. As the "toCarry" returned is the following when I tested your