Time taken to merge Spark PR's?
All, I just finished the SPARK-3182 feature and, for me, its raised a larger question of how to ensure patches that are awaiting review get noted / tagged upstream. Since I don’t have access writes to assign the above issue to myself I can’t tag it as “In Progress” like Matei mentioned so, at this rate, its just going to sit in the queue. Did I miss something on the “Contributing to Spark” page, is there a ‘tribal-knowledge’ way to let a set of commiters know that patches are ready, or is it that everyone is already too slammed and we’re all waiting diligently? :) Just trying to get some clarity on this topic, thanks! The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
Re: [VOTE] Release Apache Spark 1.1.1 (RC2)
Actually +1 from me... This is a recommendAll feature we are testing which is really compute intensive... For ranking metric calculation I was trying to run through the Netflix matrix and generate a ranked list of recommendation for all 17K products and perhaps it needs more compute than what is needed. I was running 6 nodes, 120 cores, 240 GB...It needed to shuffle around 100 GB over 6 nodes... A version with topK runs fine where K = (some multipler on number of movies each user saw and we cross validate on that) Running the following JIRA on Netflix dataset (the dataset is distributed with Jellyfish code http://i.stanford.edu/hazy/victor/Hogwild/), will reproduce the failure... https://issues.apache.org/jira/browse/SPARK-4231 The failed job I will debug more and figure out the real cause. If needed I will open up new JIRAs. On Sun, Nov 23, 2014 at 9:50 AM, Debasish Das debasish.da...@gmail.com wrote: -1 from me...same FetchFailed issue as what Hector saw... I am running Netflix dataset and dumping out recommendation for all users. It shuffles around 100 GB data on disk to run a reduceByKey per user on utils.BoundedPriorityQueue...The code runs fine with MovieLens1m dataset... I gave Spark 10 nodes, 8 cores, 160 GB of memory. Fails with the following FetchFailed errors. 14/11/23 11:51:22 WARN TaskSetManager: Lost task 28.0 in stage 188.0 (TID 2818, tblpmidn08adv-hdp.tdc.vzwcorp.com): FetchFailed(BlockManagerId(1, tblpmidn03adv-hdp.tdc.vzwcorp.com, 52528, 0), shuffleId=35, mapId=28, reduceId=28) It's a consistent behavior on master as well. I tested it both on YARN and Standalone. I compiled spark-1.1 branch (assuming it has all the fixes from RC2 tag. I am now compiling spark-1.0 branch and see if this issue shows up there as well. If it is related to hash/sort based shuffle most likely it won't show up on 1.0. Thanks. Deb On Thu, Nov 20, 2014 at 12:16 PM, Hector Yee hector@gmail.com wrote: Whoops I must have used the 1.2 preview and mixed them up. spark-shell -version shows version 1.2.0 Will update the bug https://issues.apache.org/jira/browse/SPARK-4516 to 1.2 On Thu, Nov 20, 2014 at 11:59 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Ah, I see. But the spark.shuffle.blockTransferService property doesn't exist in 1.1 (AFAIK) -- what exactly are you doing to get this problem? Matei On Nov 20, 2014, at 11:50 AM, Hector Yee hector@gmail.com wrote: This is whatever was in http://people.apache.org/~andrewor14/spark-1 .1.1-rc2/ On Thu, Nov 20, 2014 at 11:48 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hector, is this a comment on 1.1.1 or on the 1.2 preview? Matei On Nov 20, 2014, at 11:39 AM, Hector Yee hector@gmail.com wrote: I think it is a race condition caused by netty deactivating a channel while it is active. Switched to nio and it works fine --conf spark.shuffle.blockTransferService=nio On Thu, Nov 20, 2014 at 10:44 AM, Hector Yee hector@gmail.com wrote: I'm still seeing the fetch failed error and updated https://issues.apache.org/jira/browse/SPARK-3633 On Thu, Nov 20, 2014 at 10:21 AM, Marcelo Vanzin van...@cloudera.com wrote: +1 (non-binding) . ran simple things on spark-shell . ran jobs in yarn client cluster modes, and standalone cluster mode On Wed, Nov 19, 2014 at 2:51 PM, Andrew Or and...@databricks.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.1. This release fixes a number of bugs in Spark 1.1.0. Some of the notable ones are - [SPARK-3426] Sort-based shuffle compression settings are incompatible - [SPARK-3948] Stream corruption issues in sort-based shuffle - [SPARK-4107] Incorrect handling of Channel.read() led to data truncation The full list is at http://s.apache.org/z9h and in the CHANGES.txt attached. Additionally, this candidate fixes two blockers from the previous RC: - [SPARK-4434] Cluster mode jar URLs are broken - [SPARK-4480][SPARK-4467] Too many open files exception from shuffle spills The tag to be voted on is v1.1.1-rc2 (commit 3693ae5d): http://s.apache.org/p8 The release files, including signatures, digests, etc can be found at: http://people.apache.org/~andrewor14/spark-1.1.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/andrewor14.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1043/ The documentation corresponding to this release can be found at: http://people.apache.org/~andrewor14/spark-1.1.1-rc2-docs/ Please vote on releasing this package as Apache Spark 1.1.1! The vote is open until Saturday, November 22, at 23:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1
Re: [VOTE] Release Apache Spark 1.1.1 (RC2)
+1 Release this package as Apache Spark 1.1.1 On 20 Nov 2014 04:22, Andrew Or and...@databricks.com wrote: I will start with a +1 2014-11-19 14:51 GMT-08:00 Andrew Or and...@databricks.com: Please vote on releasing the following candidate as Apache Spark version 1 .1.1. This release fixes a number of bugs in Spark 1.1.0. Some of the notable ones are - [SPARK-3426] Sort-based shuffle compression settings are incompatible - [SPARK-3948] Stream corruption issues in sort-based shuffle - [SPARK-4107] Incorrect handling of Channel.read() led to data truncation The full list is at http://s.apache.org/z9h and in the CHANGES.txt attached. Additionally, this candidate fixes two blockers from the previous RC: - [SPARK-4434] Cluster mode jar URLs are broken - [SPARK-4480][SPARK-4467] Too many open files exception from shuffle spills The tag to be voted on is v1.1.1-rc2 (commit 3693ae5d): http://s.apache.org/p8 The release files, including signatures, digests, etc can be found at: http://people.apache.org/~andrewor14/spark-1.1.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/andrewor14.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1043/ The documentation corresponding to this release can be found at: http://people.apache.org/~andrewor14/spark-1.1.1-rc2-docs/ Please vote on releasing this package as Apache Spark 1.1.1! The vote is open until Saturday, November 22, at 23:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ Cheers, Andrew
[VOTE][RESULT] Release Apache Spark 1.1.1 (RC2)
The vote passes unanimously with 4 binding +1 votes, 5 non-binding +1 votes, and no +0 or -1 votes. The final release will be posted in the next 48 hours. Thanks to everyone who voted. -Andrew +1: Andrew Or* Xiangrui Meng* Krishna Sankar Matei Zaharia* Sean Owen Anant Asthana Marcelo Vanzin Patrick Wendell* Debasish Das +0: -1: *binding
Re: Notes on writing complex spark applications
Thanks Patrick, You raise a good point - for this to be useful it's imperative that it is updated with new versions of spark. My thought with putting it on the wiki was that it's lower friction for community members to edit, but it likely won't have the same level of quality control as the existing documentation. At a higher level - some of these tips are best practices for writing applications that depend on Spark. I'm wondering if a new document is in order for things like this is how you set up a project skeleton to link against spark, and this is how you handle external libraries, - etc.? I know that in the past I've run into stumbling blocks on things like getting classpaths correct, trying to link against a different version of akka, and so on that would be useful to have in such a document, in addition to some of the application architecture suggestions we propose in *this* document. - Evan On Sun, Nov 23, 2014 at 9:02 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Evan, It might be nice to merge this into existing documentation. In particular, a lot of this could serve to update the current tuning section and programming guides. It could also work to paste this wholesale as a reference for Spark users, but in that case it's less likely to get updated when other things change, or be found by users reading through the spark docs. - Patrick On Sun, Nov 23, 2014 at 8:27 PM, Inkyu Lee gof...@gmail.com wrote: Very helpful!! thank you very much! 2014-11-24 2:17 GMT+09:00 Sam Bessalah samkiller@gmail.com: Thanks Evan, this is great. On Nov 23, 2014 5:58 PM, Evan R. Sparks evan.spa...@gmail.com wrote: Hi all, Shivaram Venkataraman, Joseph Gonzalez, Tomer Kaftan, and I have been working on a short document about writing high performance Spark applications based on our experience developing MLlib, GraphX, ml-matrix, pipelines, etc. It may be a useful document both for users and new Spark developers - perhaps it should go on the wiki? The document itself is here: https://docs.google.com/document/d/1gEIawzRsOwksV_bq4je3ofnd-7Xu-u409mdW-RXTDnQ/edit?usp=sharing and I've created SPARK-4565 https://issues.apache.org/jira/browse/SPARK-4565 to track this. - Evan
[SparkSQL] Why this AttributeReference.exprId is not setted?
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Aggregate.scala#L85 I can't understand this code, it seems to be a bug, but group by of SparkSQL just works fine. with code below, some expressions are mapping to AttributeReferences, then bindReference method will find there references. But resultAttribute's exprId is new, I don't think it can find the true reference(But it does, Why? How?) private[this] val resultExpressions = aggregateExpressions.map { agg = agg.transform { case e: Expression if resultMap.contains(e) = resultMap(e) } } I'm trying to write a DSL with Catalyst -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSQL-Why-this-AttributeReference-exprId-is-not-setted-tp9521.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [SparkSQL][Solved] Why this AttributeReference.exprId is not setted?
Got it. Only NamedExpression have exprId, we have to make new Attribute here. private[this] val computedSchema = computedAggregates.map(_.resultAttribute) -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSQL-Why-this-AttributeReference-exprId-is-not-setted-tp9521p9522.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org