Time taken to merge Spark PR's?

2014-11-24 Thread York, Brennon
All, I just finished the SPARK-3182 feature and, for me, its raised a larger 
question of how to ensure patches that are awaiting review get noted / tagged 
upstream. Since I don’t have access writes to assign the above issue to myself 
I can’t tag it as “In Progress” like Matei mentioned so, at this rate, its just 
going to sit in the queue. Did I miss something on the “Contributing to Spark” 
page, is there a ‘tribal-knowledge’ way to let a set of commiters know that 
patches are ready, or is it that everyone is already too slammed and we’re all 
waiting diligently? :) Just trying to get some clarity on this topic, thanks!


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-24 Thread Debasish Das
Actually +1 from me...

This is a recommendAll feature we are testing which is really compute
intensive...

For ranking metric calculation I was trying to run through the Netflix
matrix and generate a ranked list of recommendation for all 17K products
and perhaps it needs more compute than what is needed. I was running 6
nodes, 120 cores, 240 GB...It needed to shuffle around 100 GB over 6
nodes...

A version with topK runs fine where K = (some multipler on number of movies
each user saw and we cross validate on that)

Running the following JIRA on Netflix dataset (the dataset is distributed
with Jellyfish code http://i.stanford.edu/hazy/victor/Hogwild/), will
reproduce the failure...

https://issues.apache.org/jira/browse/SPARK-4231

The failed job I will debug more and figure out the real cause. If needed I
will open up new JIRAs.

On Sun, Nov 23, 2014 at 9:50 AM, Debasish Das debasish.da...@gmail.com
wrote:

 -1 from me...same FetchFailed issue as what Hector saw...

 I am running Netflix dataset and dumping out recommendation for all users.
 It shuffles around 100 GB data on disk to run a reduceByKey per user on
 utils.BoundedPriorityQueue...The code runs fine with MovieLens1m dataset...

 I gave Spark 10 nodes, 8 cores, 160 GB of memory.

 Fails with the following FetchFailed errors.

 14/11/23 11:51:22 WARN TaskSetManager: Lost task 28.0 in stage 188.0 (TID
 2818, tblpmidn08adv-hdp.tdc.vzwcorp.com): FetchFailed(BlockManagerId(1,
 tblpmidn03adv-hdp.tdc.vzwcorp.com, 52528, 0), shuffleId=35, mapId=28,
 reduceId=28)

 It's a consistent behavior on master as well.

 I tested it both on YARN and Standalone. I compiled spark-1.1 branch
 (assuming it has all the fixes from RC2 tag.

 I am now compiling spark-1.0 branch and see if this issue shows up there
 as well. If it is related to hash/sort based shuffle most likely it won't
 show up on 1.0.

 Thanks.

 Deb

 On Thu, Nov 20, 2014 at 12:16 PM, Hector Yee hector@gmail.com wrote:

 Whoops I must have used the 1.2 preview and mixed them up.

 spark-shell -version shows  version 1.2.0

 Will update the bug https://issues.apache.org/jira/browse/SPARK-4516 to
 1.2

 On Thu, Nov 20, 2014 at 11:59 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

  Ah, I see. But the spark.shuffle.blockTransferService property doesn't
  exist in 1.1 (AFAIK) -- what exactly are you doing to get this problem?
 
  Matei
 
  On Nov 20, 2014, at 11:50 AM, Hector Yee hector@gmail.com wrote:
 
  This is whatever was in http://people.apache.org/~andrewor14/spark-1
  .1.1-rc2/
 
  On Thu, Nov 20, 2014 at 11:48 AM, Matei Zaharia 
 matei.zaha...@gmail.com
  wrote:
 
  Hector, is this a comment on 1.1.1 or on the 1.2 preview?
 
  Matei
 
   On Nov 20, 2014, at 11:39 AM, Hector Yee hector@gmail.com
 wrote:
  
   I think it is a race condition caused by netty deactivating a channel
  while
   it is active.
   Switched to nio and it works fine
   --conf spark.shuffle.blockTransferService=nio
  
   On Thu, Nov 20, 2014 at 10:44 AM, Hector Yee hector@gmail.com
  wrote:
  
   I'm still seeing the fetch failed error and updated
   https://issues.apache.org/jira/browse/SPARK-3633
  
   On Thu, Nov 20, 2014 at 10:21 AM, Marcelo Vanzin 
 van...@cloudera.com
   wrote:
  
   +1 (non-binding)
  
   . ran simple things on spark-shell
   . ran jobs in yarn client  cluster modes, and standalone cluster
 mode
  
   On Wed, Nov 19, 2014 at 2:51 PM, Andrew Or and...@databricks.com
  wrote:
   Please vote on releasing the following candidate as Apache Spark
  version
   1.1.1.
  
   This release fixes a number of bugs in Spark 1.1.0. Some of the
  notable
   ones
   are
   - [SPARK-3426] Sort-based shuffle compression settings are
  incompatible
   - [SPARK-3948] Stream corruption issues in sort-based shuffle
   - [SPARK-4107] Incorrect handling of Channel.read() led to data
   truncation
   The full list is at http://s.apache.org/z9h and in the
 CHANGES.txt
   attached.
  
   Additionally, this candidate fixes two blockers from the previous
 RC:
   - [SPARK-4434] Cluster mode jar URLs are broken
   - [SPARK-4480][SPARK-4467] Too many open files exception from
 shuffle
   spills
  
   The tag to be voted on is v1.1.1-rc2 (commit 3693ae5d):
   http://s.apache.org/p8
  
   The release files, including signatures, digests, etc can be found
  at:
   http://people.apache.org/~andrewor14/spark-1.1.1-rc2/
  
   Release artifacts are signed with the following key:
   https://people.apache.org/keys/committer/andrewor14.asc
  
   The staging repository for this release can be found at:
  
 
 https://repository.apache.org/content/repositories/orgapachespark-1043/
  
   The documentation corresponding to this release can be found at:
   http://people.apache.org/~andrewor14/spark-1.1.1-rc2-docs/
  
   Please vote on releasing this package as Apache Spark 1.1.1!
  
   The vote is open until Saturday, November 22, at 23:00 UTC and
  passes if
   a majority of at least 3 +1 PMC votes are cast.
   [ ] +1 

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-24 Thread vaquar khan
+1 Release this package as Apache Spark 1.1.1
On 20 Nov 2014 04:22, Andrew Or and...@databricks.com wrote:

 I will start with a +1

 2014-11-19 14:51 GMT-08:00 Andrew Or and...@databricks.com:

  Please vote on releasing the following candidate as Apache Spark version
 1
  .1.1.
 
  This release fixes a number of bugs in Spark 1.1.0. Some of the notable
  ones are
  - [SPARK-3426] Sort-based shuffle compression settings are incompatible
  - [SPARK-3948] Stream corruption issues in sort-based shuffle
  - [SPARK-4107] Incorrect handling of Channel.read() led to data
 truncation
  The full list is at http://s.apache.org/z9h and in the CHANGES.txt
  attached.
 
  Additionally, this candidate fixes two blockers from the previous RC:
  - [SPARK-4434] Cluster mode jar URLs are broken
  - [SPARK-4480][SPARK-4467] Too many open files exception from shuffle
  spills
 
  The tag to be voted on is v1.1.1-rc2 (commit 3693ae5d):
  http://s.apache.org/p8
 
  The release files, including signatures, digests, etc can be found at:
  http://people.apache.org/~andrewor14/spark-1.1.1-rc2/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/andrewor14.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1043/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~andrewor14/spark-1.1.1-rc2-docs/
 
  Please vote on releasing this package as Apache Spark 1.1.1!
 
  The vote is open until Saturday, November 22, at 23:00 UTC and passes if
  a majority of at least 3 +1 PMC votes are cast.
  [ ] +1 Release this package as Apache Spark 1.1.1
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  Cheers,
  Andrew
 



[VOTE][RESULT] Release Apache Spark 1.1.1 (RC2)

2014-11-24 Thread Andrew Or
The vote passes unanimously with 4 binding +1 votes, 5 non-binding +1
votes, and no +0 or -1 votes. The final release will be posted in the next
48 hours. Thanks to everyone who voted.

-Andrew

+1:

Andrew Or*
Xiangrui Meng*
Krishna Sankar
Matei Zaharia*
Sean Owen
Anant Asthana
Marcelo Vanzin
Patrick Wendell*
Debasish Das

+0:

-1:

*binding


Re: Notes on writing complex spark applications

2014-11-24 Thread Evan R. Sparks
Thanks Patrick,

You raise a good point - for this to be useful it's imperative that it is
updated with new versions of spark.

My thought with putting it on the wiki was that it's lower friction for
community members to edit, but it likely won't have the same level of
quality control as the existing documentation.

At a higher level - some of these tips are best practices for writing
applications that depend on Spark. I'm wondering if a new document is in
order for things like this is how you set up a project skeleton to link
against spark, and this is how you handle external libraries, - etc.? I
know that in the past I've run into stumbling blocks on things like getting
classpaths correct, trying to link against a different version of akka, and
so on that would be useful to have in such a document, in addition to some
of the application architecture suggestions we propose in *this* document.

- Evan

On Sun, Nov 23, 2014 at 9:02 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hey Evan,

 It might be nice to merge this into existing documentation. In
 particular, a lot of this could serve to update the current tuning
 section and programming guides.

 It could also work to paste this wholesale as a reference for Spark
 users, but in that case it's less likely to get updated when other
 things change, or be found by users reading through the spark docs.

 - Patrick

 On Sun, Nov 23, 2014 at 8:27 PM, Inkyu Lee gof...@gmail.com wrote:
  Very helpful!!
 
  thank you very much!
 
  2014-11-24 2:17 GMT+09:00 Sam Bessalah samkiller@gmail.com:
 
  Thanks Evan, this is great.
  On Nov 23, 2014 5:58 PM, Evan R. Sparks evan.spa...@gmail.com
 wrote:
 
   Hi all,
  
   Shivaram Venkataraman, Joseph Gonzalez, Tomer Kaftan, and I have been
   working on a short document about writing high performance Spark
   applications based on our experience developing MLlib, GraphX,
 ml-matrix,
   pipelines, etc. It may be a useful document both for users and new
 Spark
   developers - perhaps it should go on the wiki?
  
   The document itself is here:
  
  
 
 https://docs.google.com/document/d/1gEIawzRsOwksV_bq4je3ofnd-7Xu-u409mdW-RXTDnQ/edit?usp=sharing
   and I've created SPARK-4565
   https://issues.apache.org/jira/browse/SPARK-4565 to track this.
  
   - Evan
  
 



[SparkSQL] Why this AttributeReference.exprId is not setted?

2014-11-24 Thread EarthsonLu
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Aggregate.scala#L85

I can't understand this code, it seems to be a bug, but group by of
SparkSQL just works fine.

with code below, some expressions are mapping to AttributeReferences, then
bindReference method will find there references. But resultAttribute's
exprId is new, I don't think it can find the true reference(But it does,
Why? How?)


  private[this] val resultExpressions = aggregateExpressions.map { agg =
agg.transform {
  case e: Expression if resultMap.contains(e) = resultMap(e)
}
  }

I'm trying to write a DSL with Catalyst





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSQL-Why-this-AttributeReference-exprId-is-not-setted-tp9521.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [SparkSQL][Solved] Why this AttributeReference.exprId is not setted?

2014-11-24 Thread EarthsonLu
Got it. 

Only NamedExpression have exprId, we have to make new Attribute here.

private[this] val computedSchema = computedAggregates.map(_.resultAttribute) 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSQL-Why-this-AttributeReference-exprId-is-not-setted-tp9521p9522.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org