[GitHub] spark issue #16241: [SPARK-18812] [MLLIB] explain "Spark ML"

2016-12-09 Thread mateiz
Github user mateiz commented on the issue: https://github.com/apache/spark/pull/16241 Looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #12913: [SPARK-928][CORE] Add support for Unsafe-based se...

2016-10-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/12913#discussion_r84570152 --- Diff: core/src/test/scala/org/apache/spark/serializer/UnsafeKryoSerializerSuite.scala --- @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache

[GitHub] spark issue #12913: [SPARK-928][CORE] Add support for Unsafe-based serialize...

2016-10-20 Thread mateiz
Github user mateiz commented on the issue: https://github.com/apache/spark/pull/12913 @techaddict Cool, thanks! Just remembered a couple more things: - Can you edit KryoSerializerSuite to set the flag to false? Otherwise we might silently end up with both suites testing on true

[GitHub] spark pull request #12913: [SPARK-928][CORE] Add support for Unsafe-based se...

2016-10-18 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/12913#discussion_r83972911 --- Diff: core/src/test/scala/org/apache/spark/serializer/KryoSerializerSuite.scala --- @@ -75,9 +75,11 @@ class KryoSerializerSuite extends SparkFunSuite

[GitHub] spark issue #12913: [SPARK-928][CORE] Add support for Unsafe-based serialize...

2016-10-18 Thread mateiz
Github user mateiz commented on the issue: https://github.com/apache/spark/pull/12913 Looks pretty good overall! I made two small comments but it seems worthwhile to add in and it's not a huge change. --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark pull request #12913: [SPARK-928][CORE] Add support for Unsafe-based se...

2016-10-18 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/12913#discussion_r83971706 --- Diff: core/src/test/scala/org/apache/spark/serializer/KryoSerializerSuite.scala --- @@ -75,9 +75,11 @@ class KryoSerializerSuite extends SparkFunSuite

[GitHub] spark pull request #12913: [SPARK-928][CORE] Add support for Unsafe-based se...

2016-10-18 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/12913#discussion_r83971396 --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala --- @@ -78,8 +79,14 @@ class KryoSerializer(conf: SparkConf) .filter

[GitHub] spark issue #8318: [SPARK-1267][PYSPARK] Adds pip installer for pyspark

2016-10-18 Thread mateiz
Github user mateiz commented on the issue: https://github.com/apache/spark/pull/8318 Probably switching from the PySpark in PyPI to a version you installed locally by downloading Spark. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark issue #8318: [SPARK-1267][PYSPARK] Adds pip installer for pyspark

2016-10-16 Thread mateiz
Github user mateiz commented on the issue: https://github.com/apache/spark/pull/8318 Yes, it would be great to get this done. Just make sure that we have a good way to test it. Can you also document how a user is supposed to switch to a different pyspark (if they do have Spark

[GitHub] spark issue #8318: [SPARK-1267][PYSPARK] Adds pip installer for pyspark

2016-10-08 Thread mateiz
Github user mateiz commented on the issue: https://github.com/apache/spark/pull/8318 Cool, good to know that there's another ASF project that does it. We should go for it then. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark issue #8318: [SPARK-1267][PYSPARK] Adds pip installer for pyspark

2016-10-07 Thread mateiz
Github user mateiz commented on the issue: https://github.com/apache/spark/pull/8318 BTW the other change now is that we don't make an assembly JAR by default anymore, though we could build one for this. We just need a build script for this that's solid, produces a release-policy

[GitHub] spark issue #8318: [SPARK-1267][PYSPARK] Adds pip installer for pyspark

2016-10-07 Thread mateiz
Github user mateiz commented on the issue: https://github.com/apache/spark/pull/8318 Something like this would be great IMO. A few questions though: * How will it work if users want to run a different version of PySpark from a different version of Spark (maybe something

[GitHub] spark issue #14956: [SPARK-17389] [ML] [MLLIB] KMeans speedup with better ch...

2016-09-10 Thread mateiz
Github user mateiz commented on the issue: https://github.com/apache/spark/pull/14956 Cool, thanks for improving the PIC test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request #14956: [SPARK-17389] [ML] [MLLIB] KMeans speedup with be...

2016-09-09 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/14956#discussion_r78270573 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -395,7 +395,7 @@ object PowerIterationClustering

[GitHub] spark issue #14956: [SPARK-17389] [ML] [MLLIB] KMeans speedup with better ch...

2016-09-09 Thread mateiz
Github user mateiz commented on the issue: https://github.com/apache/spark/pull/14956 Cool, then it does make sense to change it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #14956: [SPARK-17389] [ML] [MLLIB] KMeans speedup with better ch...

2016-09-07 Thread mateiz
Github user mateiz commented on the issue: https://github.com/apache/spark/pull/14956 I think the number 5 is indeed from that paper (I think from figure 5.1 actually), but have you tested the effect of using R=2 empirically? It would be good to check that they match what's

[GitHub] spark pull request #14956: [SPARK-17389] [ML] [MLLIB] KMeans speedup with be...

2016-09-07 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/14956#discussion_r77874667 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -395,7 +395,7 @@ object PowerIterationClustering

[GitHub] spark pull request #14931: [SPARK-17370] Shuffle service files not invalidat...

2016-09-01 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/14931#discussion_r77294974 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -346,15 +346,16 @@ private[spark] class TaskSchedulerImpl

[GitHub] spark pull request #13748: [SPARK-16031] Add debug-only socket source in Str...

2016-06-17 Thread mateiz
GitHub user mateiz opened a pull request: https://github.com/apache/spark/pull/13748 [SPARK-16031] Add debug-only socket source in Structured Streaming ## What changes were proposed in this pull request? This patch adds a text-based socket source similar to the one in Spark

[GitHub] spark issue #13609: [SPARK-15879] [DOCS] [UI] Update logo in UI and docs to ...

2016-06-10 Thread mateiz
Github user mateiz commented on the issue: https://github.com/apache/spark/pull/13609 Looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request: [SPARK-15346] [MLlib] Reduce duplicate computa...

2016-05-16 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/13133#issuecomment-219463999 Cool, the change does look right to me, but as Sean said there are some style issues. It should definitely help speed up initialization! --- If your project is set up

[GitHub] spark pull request: [SPARK-14356] Update spark.sql.execution.debug...

2016-04-03 Thread mateiz
GitHub user mateiz opened a pull request: https://github.com/apache/spark/pull/12140 [SPARK-14356] Update spark.sql.execution.debug to work on Datasets ## What changes were proposed in this pull request? Update DebugQuery to work on Datasets of any type, not just DataFrames

[GitHub] spark pull request: [SPARK-12091] [PYSPARK] [Minor] Default storag...

2015-12-02 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/10092#issuecomment-161365442 It might be nice to only expose a smaller # of storage levels in Python, i.e. call them memory_only and memory_and_disk, but always use the serialized ones underneath

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-11-09 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9256#discussion_r44303122 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/StateSpec.scala --- @@ -0,0 +1,196 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-11-09 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9256#discussion_r44299243 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/State.scala --- @@ -0,0 +1,199 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-11-09 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9256#discussion_r44299415 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/EmittedRecordsDStream.scala --- @@ -0,0 +1,114 @@ +/* + * Licensed

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

2015-11-08 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9555#discussion_r44238871 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SumOf.scala --- @@ -0,0 +1,31 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

2015-11-08 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9555#discussion_r44238545 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala --- @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

2015-11-08 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/9555#issuecomment-154909428 The user-facing API looks good to me! I added some comments on the internal interfaces though. --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

2015-11-08 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9555#discussion_r44238765 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SumOf.scala --- @@ -0,0 +1,31 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-11578] [SQL] User API for Typed Aggrega...

2015-11-08 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9555#discussion_r44238638 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala --- @@ -39,10 +39,10 @@ private[sql] object Column

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-11-07 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9256#discussion_r44213675 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala --- @@ -351,6 +351,50 @@ class PairDStreamFunctions[K, V

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-11-07 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9256#discussion_r44213622 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/EmittedRecordsDStream.scala --- @@ -0,0 +1,114 @@ +/* + * Licensed

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-11-07 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9256#discussion_r44214012 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/StateSpec.scala --- @@ -0,0 +1,196 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-11-07 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9256#discussion_r44213899 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/State.scala --- @@ -0,0 +1,199 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-11-07 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9256#discussion_r44213890 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/State.scala --- @@ -0,0 +1,199 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-11-07 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9256#discussion_r44213932 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/State.scala --- @@ -0,0 +1,199 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-11-07 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9256#discussion_r44213921 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/State.scala --- @@ -0,0 +1,199 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-11-07 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9256#discussion_r44214006 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/StateSpec.scala --- @@ -0,0 +1,181 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-11-07 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9256#discussion_r44214008 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/StateSpec.scala --- @@ -0,0 +1,196 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-11-07 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9256#discussion_r44214064 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/rdd/TrackStateRDD.scala --- @@ -0,0 +1,190 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-11-07 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9256#discussion_r44214022 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/EmittedRecordsDStream.scala --- @@ -0,0 +1,114 @@ +/* + * Licensed

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-11-07 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9256#discussion_r44214016 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/EmittedRecordsDStream.scala --- @@ -0,0 +1,114 @@ +/* + * Licensed

[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

2015-11-02 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9415#discussion_r43712005 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/DatasetWordCount.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-11458][SQL] add word count example for ...

2015-11-02 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9415#discussion_r43712012 --- Diff: examples/src/main/scala/org/apache/spark/examples/sql/DatasetWordCount.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-11112] DAG visualization: display RDD c...

2015-11-02 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/9398#issuecomment-153226208 Thanks for adding this! The UI itself looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-8029][core][wip] first successful shuff...

2015-10-26 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/9214#issuecomment-151209206 Maybe just go for version 2) above then, it seems like the simplest one. Regarding re-engineering vs not, the problem is that if you're trying to do a bug fix

[GitHub] spark pull request: [SPARK-8029][core][wip] first successful shuff...

2015-10-23 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/9214#issuecomment-150708764 Hey so I'm curious about two things here: 1) If we just always replaced the output with a new one using a file rename, would we actually have a problem? I think

[GitHub] spark pull request: [SPARK-11256] Mark all Stage/ResultStage/Shuff...

2015-10-22 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9219#discussion_r42764498 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ShuffleMapStage.scala --- @@ -43,35 +43,53 @@ private[spark] class ShuffleMapStage( val

[GitHub] spark pull request: [SPARK-11116] [SQL] First Draft of Dataset API

2015-10-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9190#discussion_r42702840 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -0,0 +1,132 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-11116] [SQL] First Draft of Dataset API

2015-10-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9190#discussion_r42702889 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/Encoder.scala --- @@ -31,6 +31,7 @@ import org.apache.spark.sql.types.StructType

[GitHub] spark pull request: [SPARK-11116] [SQL] First Draft of Dataset API

2015-10-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9190#discussion_r42702980 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/Encoder.scala --- @@ -46,13 +47,27 @@ trait Encoder[T

[GitHub] spark pull request: [SPARK-11116] [SQL] First Draft of Dataset API

2015-10-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9190#discussion_r42703300 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/Encoder.scala --- @@ -46,13 +47,27 @@ trait Encoder[T

[GitHub] spark pull request: Minor cleanup of ShuffleMapStage.outputLocs co...

2015-10-20 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/9175#discussion_r42573604 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -353,10 +353,15 @@ class DAGScheduler

[GitHub] spark pull request: [SPARK-8029][core] shuffleoutput per attempt

2015-10-12 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/6648#issuecomment-147553320 BTW, with that design, I also wouldn't even implement the delete message in the first patch, unless we've actually seen block corruptions happen; but it sounds like we

[GitHub] spark pull request: [SPARK-8029][core] shuffleoutput per attempt

2015-10-12 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/6648#issuecomment-147552582 Hey Imran, Given the number of changes required for this approach, I wonder whether an atomic rename design wouldn't be simpler (in particular, the "

[GitHub] spark pull request: [SPARK-9852] Let reduce tasks fetch multiple m...

2015-09-24 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8844#issuecomment-143113277 Alright, merged this, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-9852] Let reduce tasks fetch multiple m...

2015-09-23 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8844#issuecomment-142703433 Alright, I made the suggested changes. I don't think we need to make those classes `private[spark]` because they are in `src/test`, right? --- If your project is set up

[GitHub] spark pull request: [SPARK-9852] Let reduce tasks fetch multiple m...

2015-09-23 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8844#issuecomment-142802670 Alright, let me know if you guys have any other comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-9852] Let reduce tasks fetch multiple m...

2015-09-22 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/8844#discussion_r40170084 --- Diff: core/src/test/scala/org/apache/spark/scheduler/CustomShuffledRDD.scala --- @@ -0,0 +1,111 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-9852] Let reduce tasks fetch multiple m...

2015-09-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/8844#discussion_r40026880 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -323,6 +351,30 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf

[GitHub] spark pull request: [SPARK-9852] Let reduce tasks fetch multiple m...

2015-09-20 Thread mateiz
GitHub user mateiz opened a pull request: https://github.com/apache/spark/pull/8844 [SPARK-9852] Let reduce tasks fetch multiple map output partitions This makes two changes: - Allow reduce tasks to fetch multiple map output partitions -- this is a pretty small change

[GitHub] spark pull request: [SPARK-9852] Let reduce tasks fetch multiple m...

2015-09-20 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8844#issuecomment-141841616 @shivaram, @JoshRosen, @zsxwing this may be relevant to you --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [SPARK-9852] Let reduce tasks fetch multiple m...

2015-09-20 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/8844#discussion_r39935722 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -474,9 +495,9 @@ class DAGSchedulerSuite test(&quo

[GitHub] spark pull request: [SPARK-10704] Consolidate HashShuffleReader an...

2015-09-19 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8825#issuecomment-141693510 BTW another thing you should consider is just renaming HashShuffleReader to BlockStoreShuffleReader and still leaving in the abstract interface. The interface

[GitHub] spark pull request: [SPARK-10704] Consolidate HashShuffleReader an...

2015-09-19 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8825#issuecomment-141688454 By the way, the reason it took contiguous partition IDs was to make them cheap to read from disk in one read. So I'd like to try keeping it like that before we decide

[GitHub] spark pull request: [SPARK-10704] Consolidate HashShuffleReader an...

2015-09-19 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8825#issuecomment-141688039 I already have a patch for the range of partitions, so please leave that in. https://github.com/mateiz/spark/tree/spark-9852 --- If your project is set up for it, you

[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...

2015-09-13 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/8180#discussion_r39352275 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -720,31 +843,82 @@ class DAGScheduler( try { // New

[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...

2015-09-13 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8180#issuecomment-139918503 Thanks for the comments; I've made the fixes. Let me know if anyone else has other comments. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...

2015-09-07 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8180#issuecomment-138311887 @zsxwing / @squito can you take a second look at this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-10192] [core] simple test w/ failure in...

2015-09-04 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8402#issuecomment-137825173 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request: [SPARK-10192] [core] simple test w/ failure in...

2015-09-04 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8402#issuecomment-137825995 Although this seems to have failed another test? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...

2015-09-04 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8180#issuecomment-137838489 Alright, I think this is ready to review now. Changes made: - Added more docs to DAGScheduler about how stages may be re-attempted - Added tests on: - More

[GitHub] spark pull request: [SPARK-10192] [core] simple test w/ failure in...

2015-09-04 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8402#issuecomment-137882457 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...

2015-09-04 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8180#issuecomment-137887534 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...

2015-09-03 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8180#issuecomment-137460863 BTW, I'm working on updating this with a few more tests as suggested as well. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...

2015-09-03 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8180#issuecomment-137459384 Before deciding whether it's a big change, do also take a look at the change. As I said, it's only about 100-200 lines of actual changes, the rest is comments

[GitHub] spark pull request: [SPARK-10248] [core] track exceptions in dagsc...

2015-09-02 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8466#issuecomment-137197353 Hey, so is the conclusion that the DAGScheduler actually did pass the exception to JobListeners, but we weren't listening for it in our test suite? I thought the initial

[GitHub] spark pull request: [SPARK-5259][CORE] don't submit stage until it...

2015-09-02 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/7699#issuecomment-137196052 Thanks, this makes sense. Anyway this PR looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...

2015-09-02 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/8180#discussion_r38587515 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -746,6 +848,63 @@ class DAGScheduler( submitWaitingStages

[GitHub] spark pull request: [SPARK-5259][CORE] don't submit stage until it...

2015-08-28 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/7699#issuecomment-135802458 Regarding the wider class of problems, I just meant that the core problem here seems to be that tasks don't get identified correctly. This also seems to affect other

[GitHub] spark pull request: [SPARK-5259][CORE] don't submit stage until it...

2015-08-25 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/7699#issuecomment-134739986 This looks good to me too. I agree it's better to use .length instead of .size now that IntelliJ complains about it (it used not to). --- If your project is set up

[GitHub] spark pull request: [SPARK-5259][CORE] don't submit stage until it...

2015-08-25 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/7699#issuecomment-134740504 BTW I'd rename this JIRA or at least expand the PR description to say track pending tasks by partition ID instead of Task objects. Otherwise it really doesn't explain

[GitHub] spark pull request: [SPARK-5259][CORE] don't submit stage until it...

2015-08-25 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/7699#discussion_r37919314 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -695,6 +696,115 @@ class DAGSchedulerSuite

[GitHub] spark pull request: [SPARK-5259][CORE] don't submit stage until it...

2015-08-25 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/7699#discussion_r37919381 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -695,6 +696,115 @@ class DAGSchedulerSuite

[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...

2015-08-21 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8180#issuecomment-133581945 Hey Imran, I'm curious, have you actually worked on stuff in the scheduler? I don't know what you mean about inability to deal with complexity in it, but it has gotten

[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...

2015-08-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/8180#discussion_r37688430 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -132,13 +133,46 @@ private[spark] abstract class MapOutputTracker(conf: SparkConf

[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...

2015-08-19 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8280#issuecomment-132799456 @shivaram did you create a JIRA for making this affect only ShuffledRDD? I might do it as part of https://issues.apache.org/jira/browse/SPARK-9852, which I'm working

[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...

2015-08-18 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8280#issuecomment-132394991 It does sound good to turn it off if there are multiple dependencies. However, an even better solution may be to move this into ShuffledRDD, so that we control where

[GitHub] spark pull request: [SPARK-10087] [CORE] Disable spark.shuffle.red...

2015-08-18 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8280#issuecomment-132395677 BTW it may also be fine to turn it off by default for 1.5, but in general, with these things, there's not much point having them in the code if they're off by default

[GitHub] spark pull request: [SPARK-10008] Ensure shuffle locality doesn't ...

2015-08-15 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8220#issuecomment-131422078 Sounds good.. I'll merge it once tests pass. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-10008] Ensure shuffle locality doesn't ...

2015-08-14 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8220#issuecomment-131281158 @shivaram here it is.. we should merge this into branch-1.5 too if it's good. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [SPARK-10008] Ensure shuffle locality doesn't ...

2015-08-14 Thread mateiz
GitHub user mateiz opened a pull request: https://github.com/apache/spark/pull/8220 [SPARK-10008] Ensure shuffle locality doesn't take precedence over narrow deps The shuffle locality patch made the DAGScheduler aware of shuffle data, but for RDDs that have both narrow

[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...

2015-08-13 Thread mateiz
GitHub user mateiz opened a pull request: https://github.com/apache/spark/pull/8180 [SPARK-9851] Support submitting map stages individually in DAGScheduler This patch adds support for submitting map stages in a DAG individually so that we can make downstream decisions after seeing

[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...

2015-08-13 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8180#issuecomment-130871144 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...

2015-08-13 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8180#issuecomment-130968746 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-9303] Decimal should use java.math.Deci...

2015-08-10 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/8018#discussion_r36688369 --- Diff: unsafe/src/main/java/org/apache/spark/unsafe/PlatformDependent.java --- @@ -145,21 +147,27 @@ public static void freeMemory(long address

[GitHub] spark pull request: [SPARK-9303] Decimal should use java.math.Deci...

2015-08-10 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/8018#issuecomment-129619741 I didn't realize that Java's BigDecimal already has a shortcut for things that fit in a Long. That definitely simplifies it. In terms of this change, the biggest thing

[GitHub] spark pull request: [SPARK-9394][SQL] Handle parentheses in CodeFo...

2015-07-28 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/7712#issuecomment-125482355 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request: [SPARK-9394][SQL] Handle parentheses in CodeFo...

2015-07-27 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/7712#discussion_r35614193 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala --- @@ -35,11 +34,12 @@ private class

  1   2   3   4   5   6   7   8   9   10   >