[jira] [Resolved] (SPARK-4108) Fix uses of @deprecated in catalyst dataTypes
[ https://issues.apache.org/jira/browse/SPARK-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4108. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2970 [https://github.com/apache/spark/pull/2970] Fix uses of @deprecated in catalyst dataTypes - Key: SPARK-4108 URL: https://issues.apache.org/jira/browse/SPARK-4108 Project: Spark Issue Type: Task Components: SQL Reporter: Anant Daksh Asthana Priority: Trivial Fix For: 1.2.0 @deprecated takes 2 parameters message and version sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala has a usage of @deprecated with just one parameter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4079) Snappy bundled with Spark does not work on older Linux distributions
[ https://issues.apache.org/jira/browse/SPARK-4079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191457#comment-14191457 ] Patrick Wendell commented on SPARK-4079: Yeah that sounds like a good call. Did you want to do this? Snappy bundled with Spark does not work on older Linux distributions Key: SPARK-4079 URL: https://issues.apache.org/jira/browse/SPARK-4079 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Marcelo Vanzin This issue has existed at least since 1.0, but has been made worse by 1.1 since snappy is now the default compression algorithm. When trying to use it on a CentOS 5 machine, for example, you'll get something like this: {noformat} java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:319) at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:226) at org.xerial.snappy.Snappy.clinit(Snappy.java:48) at org.xerial.snappy.SnappyOutputStream.init(SnappyOutputStream.java:79) at org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125) at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:207) ... Caused by: java.lang.UnsatisfiedLinkError: /tmp/snappy-1.0.5.3-af72bf3c-9dab-43af-a662-f9af657f06b1-libsnappyjava.so: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.9' not found (required by /tmp/snappy-1.0.5.3-af72bf3c-9dab-43af-a662-f9af657f06b1-libsnappyjava.so) at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1957) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1882) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1843) at java.lang.Runtime.load0(Runtime.java:795) at java.lang.System.load(System.java:1061) at org.xerial.snappy.SnappyNativeLoader.load(SnappyNativeLoader.java:39) ... 29 more {noformat} There are two approaches I can see here (well, 3): * Declare CentOS 5 (and similar OSes) not supported, although that would suck for the people who are still on it and already use Spark * Fallback to another compression codec if Snappy cannot be loaded * Ask the Snappy guys to compile the library on an older OS... I think the second would be the best compromise. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3987) NNLS generates incorrect result
[ https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191459#comment-14191459 ] Xiangrui Meng commented on SPARK-3987: -- Please check the condition number of the matrix you sent. Did you run ALS with a very small lambda? NNLS generates incorrect result --- Key: SPARK-3987 URL: https://issues.apache.org/jira/browse/SPARK-3987 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Debasish Das Assignee: Shuo Xiang Fix For: 1.1.1, 1.2.0 Hi, Please see the example gram matrix and linear term: val P2 = new DoubleMatrix(20, 20, 333907.312770, -60814.043975, 207935.829941, -162881.367739, -43730.396770, 17511.428983, -243340.496449, -225245.957922, 104700.445881, 32430.845099, 336378.693135, -373497.970207, -41147.159621, 53928.060360, -293517.883778, 53105.278068, 0.00, -85257.781696, 84913.970469, -10584.080103, -60814.043975, 13826.806664, -38032.612640, 33475.833875, 10791.916809, -1040.950810, 48106.552472, 45390.073380, -16310.282190, -2861.455903, -60790.833191, 73109.516544, 9826.614644, -8283.992464, 56991.742991, -6171.366034, 0.00, 19152.382499, -13218.721710, 2793.734234, 207935.829941, -38032.612640, 129661.677608, -101682.098412, -27401.299347, 10787.713362, -151803.006149, -140563.601672, 65067.935324, 20031.263383, 209521.268600, -232958.054688, -25764.179034, 33507.951918, -183046.845592, 32884.782835, 0.00, -53315.811196, 52770.762546, -6642.187643, -162881.367739, 33475.833875, -101682.098412, 85094.407608, 25422.850782, -5437.646141, 124197.166330, 116206.265909, -47093.484134, -11420.168521, -163429.436848, 189574.783900, 23447.172314, -24087.375367, 148311.355507, -20848.385466, 0.00, 46835.814559, -38180.352878, 6415.873901, -43730.396770, 10791.916809, -27401.299347, 25422.850782, 8882.869799, 15.638084, 35933.473986, 34186.371325, -10745.330690, -974.314375, -43537.709621, 54371.010558, 7894.453004, -5408.929644, 42231.381747, -3192.010574, 0.00, 15058.753110, -8704.757256, 2316.581535, 17511.428983, -1040.950810, 10787.713362, -5437.646141, 15.638084, 2794.949847, -9681.950987, -8258.171646, 7754.358930, 4193.359412, 18052.143842, -15456.096769, -253.356253, 4089.672804, -12524.380088, 5651.579348, 0.00, -1513.302547, 6296.461898, 152.427321, -243340.496449, 48106.552472, -151803.006149, 124197.166330, 35933.473986, -9681.950987, 182931.600236, 170454.352953, -72361.174145, -19270.461728, -244518.179729, 279551.060579, 33340.452802, -37103.267653, 219025.288975, -33687.141423, 0.00, 67347.950443, -58673.009647, 8957.800259, -225245.957922, 45390.073380, -140563.601672, 116206.265909, 34186.371325, -8258.171646, 170454.352953, 159322.942894, -66074.960534, -16839.743193, -226173.967766, 260421.044094, 31624.194003, -33839.612565, 203889.695169, -30034.828909, 0.00, 63525.040745, -53572.741748, 8575.071847, 104700.445881, -16310.282190, 65067.935324, -47093.484134, -10745.330690, 7754.358930, -72361.174145, -66074.960534, 35869.598076, 13378.653317, 106033.647837, -111831.682883, -10455.465743, 18537.392481, -88370.612394, 20344.288488, 0.00, -22935.482766, 29004.543704, -2409.461759, 32430.845099, -2861.455903, 20031.263383, -11420.168521, -974.314375, 4193.359412, -19270.461728, -16839.743193, 13378.653317, 6802.081898, 33256.395091, -30421.985199, -1296.785870, 7026.518692, -24443.378205, 9221.982599, 0.00, -4088.076871, 10861.014242, -25.092938, 336378.693135, -60790.833191, 209521.268600, -163429.436848, -43537.709621, 18052.143842, -244518.179729, -226173.967766, 106033.647837, 33256.395091, 339200.268106, -375442.716811, -41027.594509, 54636.778527, -295133.248586, 54177.278365, 0.00, -85237.666701, 85996.957056, -10503.209968, -373497.970207, 73109.516544, -232958.054688, 189574.783900, 54371.010558, -15456.096769, 279551.060579, 260421.044094, -111831.682883, -30421.985199, -375442.716811, 427793.208465, 50528.074431, -57375.986301, 335203.382015, -52676.385869, 0.00, 102368.307670, -90679.792485, 13509.390393, -41147.159621, 9826.614644, -25764.179034, 23447.172314, 7894.453004, -253.356253, 33340.452802, 31624.194003, -10455.465743, -1296.785870, -41027.594509, 50528.074431, 7255.977434, -5281.636812, 39298.355527, -3440.450858, 0.00, 13717.870243, -8471.405582, 2071.812204, 53928.060360, -8283.992464, 33507.951918, -24087.375367, -5408.929644, 4089.672804, -37103.267653, -33839.612565, 18537.392481, 7026.518692, 54636.778527, -57375.986301, -5281.636812, 9735.061160, -45360.674033, 10634.633559, 0.00, -11652.364691, 15039.566630, -1202.539106, -293517.883778, 56991.742991, -183046.845592, 148311.355507,
[jira] [Updated] (SPARK-4164) spark.kryo.registrator shall use comma separated value to support multiple registrator
[ https://issues.apache.org/jira/browse/SPARK-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jarred Li updated SPARK-4164: - Remaining Estimate: 2h Original Estimate: 2h spark.kryo.registrator shall use comma separated value to support multiple registrator -- Key: SPARK-4164 URL: https://issues.apache.org/jira/browse/SPARK-4164 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Jarred Li Original Estimate: 2h Remaining Estimate: 2h Currently, spark.kryo.registrator only support one registrator class. For example, conf.set(spark.kryo.registrator, org.apache.spark.graphx.GraphRegistrator). However, if there is user defined registrator class, it can not be registered. To improve the code, we can change the code in KryoSerializer to support class with separator( for example comma). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4164) spark.kryo.registrator shall use comma separated value to support multiple registrator
[ https://issues.apache.org/jira/browse/SPARK-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191475#comment-14191475 ] Jarred Li commented on SPARK-4164: -- I can work on this issue. Could somebody assign this issue to me? Thanks! spark.kryo.registrator shall use comma separated value to support multiple registrator -- Key: SPARK-4164 URL: https://issues.apache.org/jira/browse/SPARK-4164 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Jarred Li Original Estimate: 2h Remaining Estimate: 2h Currently, spark.kryo.registrator only support one registrator class. For example, conf.set(spark.kryo.registrator, org.apache.spark.graphx.GraphRegistrator). However, if there is user defined registrator class, it can not be registered. To improve the code, we can change the code in KryoSerializer to support class with separator( for example comma). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4165) Actor with Companion throws ambiguous reference error in REPL
Shiti Saxena created SPARK-4165: --- Summary: Actor with Companion throws ambiguous reference error in REPL Key: SPARK-4165 URL: https://issues.apache.org/jira/browse/SPARK-4165 Project: Spark Issue Type: Bug Reporter: Shiti Saxena Tried the following in the master branch REPL. {noformat} Spark context available as sc. scala import akka.actor.{Actor,Props} import akka.actor.{Actor, Props} scala :pas // Entering paste mode (ctrl-D to finish) class EchoActor extends Actor{ override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } // Exiting paste mode, now interpreting. defined class EchoActor defined module EchoActor scala EchoActor.props console:15: error: reference to EchoActor is ambiguous; it is imported twice in the same scope by import $VAL1.EchoActor and import INSTANCE.EchoActor EchoActor.props {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3987) NNLS generates incorrect result
[ https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191502#comment-14191502 ] Debasish Das commented on SPARK-3987: - Nope...standard ALS...same as netflix params...0.065 as L2...My ratings are not within 1-5 but more like 1-10... Also what's a good condition number for NNLS ? On Thu, Oct 30, 2014 at 11:25 PM, Xiangrui Meng (JIRA) j...@apache.org NNLS generates incorrect result --- Key: SPARK-3987 URL: https://issues.apache.org/jira/browse/SPARK-3987 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Debasish Das Assignee: Shuo Xiang Fix For: 1.1.1, 1.2.0 Hi, Please see the example gram matrix and linear term: val P2 = new DoubleMatrix(20, 20, 333907.312770, -60814.043975, 207935.829941, -162881.367739, -43730.396770, 17511.428983, -243340.496449, -225245.957922, 104700.445881, 32430.845099, 336378.693135, -373497.970207, -41147.159621, 53928.060360, -293517.883778, 53105.278068, 0.00, -85257.781696, 84913.970469, -10584.080103, -60814.043975, 13826.806664, -38032.612640, 33475.833875, 10791.916809, -1040.950810, 48106.552472, 45390.073380, -16310.282190, -2861.455903, -60790.833191, 73109.516544, 9826.614644, -8283.992464, 56991.742991, -6171.366034, 0.00, 19152.382499, -13218.721710, 2793.734234, 207935.829941, -38032.612640, 129661.677608, -101682.098412, -27401.299347, 10787.713362, -151803.006149, -140563.601672, 65067.935324, 20031.263383, 209521.268600, -232958.054688, -25764.179034, 33507.951918, -183046.845592, 32884.782835, 0.00, -53315.811196, 52770.762546, -6642.187643, -162881.367739, 33475.833875, -101682.098412, 85094.407608, 25422.850782, -5437.646141, 124197.166330, 116206.265909, -47093.484134, -11420.168521, -163429.436848, 189574.783900, 23447.172314, -24087.375367, 148311.355507, -20848.385466, 0.00, 46835.814559, -38180.352878, 6415.873901, -43730.396770, 10791.916809, -27401.299347, 25422.850782, 8882.869799, 15.638084, 35933.473986, 34186.371325, -10745.330690, -974.314375, -43537.709621, 54371.010558, 7894.453004, -5408.929644, 42231.381747, -3192.010574, 0.00, 15058.753110, -8704.757256, 2316.581535, 17511.428983, -1040.950810, 10787.713362, -5437.646141, 15.638084, 2794.949847, -9681.950987, -8258.171646, 7754.358930, 4193.359412, 18052.143842, -15456.096769, -253.356253, 4089.672804, -12524.380088, 5651.579348, 0.00, -1513.302547, 6296.461898, 152.427321, -243340.496449, 48106.552472, -151803.006149, 124197.166330, 35933.473986, -9681.950987, 182931.600236, 170454.352953, -72361.174145, -19270.461728, -244518.179729, 279551.060579, 33340.452802, -37103.267653, 219025.288975, -33687.141423, 0.00, 67347.950443, -58673.009647, 8957.800259, -225245.957922, 45390.073380, -140563.601672, 116206.265909, 34186.371325, -8258.171646, 170454.352953, 159322.942894, -66074.960534, -16839.743193, -226173.967766, 260421.044094, 31624.194003, -33839.612565, 203889.695169, -30034.828909, 0.00, 63525.040745, -53572.741748, 8575.071847, 104700.445881, -16310.282190, 65067.935324, -47093.484134, -10745.330690, 7754.358930, -72361.174145, -66074.960534, 35869.598076, 13378.653317, 106033.647837, -111831.682883, -10455.465743, 18537.392481, -88370.612394, 20344.288488, 0.00, -22935.482766, 29004.543704, -2409.461759, 32430.845099, -2861.455903, 20031.263383, -11420.168521, -974.314375, 4193.359412, -19270.461728, -16839.743193, 13378.653317, 6802.081898, 33256.395091, -30421.985199, -1296.785870, 7026.518692, -24443.378205, 9221.982599, 0.00, -4088.076871, 10861.014242, -25.092938, 336378.693135, -60790.833191, 209521.268600, -163429.436848, -43537.709621, 18052.143842, -244518.179729, -226173.967766, 106033.647837, 33256.395091, 339200.268106, -375442.716811, -41027.594509, 54636.778527, -295133.248586, 54177.278365, 0.00, -85237.666701, 85996.957056, -10503.209968, -373497.970207, 73109.516544, -232958.054688, 189574.783900, 54371.010558, -15456.096769, 279551.060579, 260421.044094, -111831.682883, -30421.985199, -375442.716811, 427793.208465, 50528.074431, -57375.986301, 335203.382015, -52676.385869, 0.00, 102368.307670, -90679.792485, 13509.390393, -41147.159621, 9826.614644, -25764.179034, 23447.172314, 7894.453004, -253.356253, 33340.452802, 31624.194003, -10455.465743, -1296.785870, -41027.594509, 50528.074431, 7255.977434, -5281.636812, 39298.355527, -3440.450858, 0.00, 13717.870243, -8471.405582, 2071.812204, 53928.060360, -8283.992464, 33507.951918, -24087.375367, -5408.929644, 4089.672804, -37103.267653, -33839.612565, 18537.392481, 7026.518692, 54636.778527, -57375.986301, -5281.636812, 9735.061160, -45360.674033,
[jira] [Created] (SPARK-4166) Display the executor ID in the Web UI when ExecutorLostFailure happens
Shixiong Zhu created SPARK-4166: --- Summary: Display the executor ID in the Web UI when ExecutorLostFailure happens Key: SPARK-4166 URL: https://issues.apache.org/jira/browse/SPARK-4166 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Affects Versions: 1.1.0 Reporter: Shixiong Zhu Priority: Minor Now when ExecutorLostFailure happens, it only displays ExecutorLostFailure (executor lost) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4143) Move inner class DeferredObjectAdapter to top level
[ https://issues.apache.org/jira/browse/SPARK-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4143. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3007 [https://github.com/apache/spark/pull/3007] Move inner class DeferredObjectAdapter to top level --- Key: SPARK-4143 URL: https://issues.apache.org/jira/browse/SPARK-4143 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Trivial Fix For: 1.2.0 The class DeferredObjectAdapter is the inner class of HiveGenericUdf, which may cause some overhead in closure ser/de-ser. Move it to top level. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4167) Schedule task on Executor will be Imbalance while task run less than local-wait time
SuYan created SPARK-4167: Summary: Schedule task on Executor will be Imbalance while task run less than local-wait time Key: SPARK-4167 URL: https://issues.apache.org/jira/browse/SPARK-4167 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: SuYan Recently, when run a spark on yarn job. it occurs executor schedules imbalance. the procedure is that: 1. because user's mistake, the spark on yarn job's input split contains 0 byte empty splits. 1.1: task0-99 , no-preference task(0 byte) task100-800, node-local task 1.2: user will run task 500 loops 1.3: 60 executor 2. executor A only have 2 node-local task in the first loop, executor A first finished node-local-task, the it will run no-preference task, and the no-preference task in our situation have smaller input split than node-local task. So executor A finished all no-reference task, while others still run node-local job. in the second loop, all task have process-local level, and all task finished in 3 seconds, so while executor A is still run process-local task while others are all finished process-local task. but all process-task run by executor A will finished in 3 seconds, so the local level will always be process-local. it results other executors are all wait for executor A the same situation in the left loops. To solve this situation, we let user to delete the empty input split. but is still have implied imbalance, while in some loops, a executor got more process-local task than others in one loop, and this task all less-3 seconds task. and then in the left loops, the others executor will wait that executor to finished all process-local tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4166) Display the executor ID in the Web UI when ExecutorLostFailure happens
[ https://issues.apache.org/jira/browse/SPARK-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191519#comment-14191519 ] Apache Spark commented on SPARK-4166: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/3033 Display the executor ID in the Web UI when ExecutorLostFailure happens -- Key: SPARK-4166 URL: https://issues.apache.org/jira/browse/SPARK-4166 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Affects Versions: 1.1.0 Reporter: Shixiong Zhu Priority: Minor Now when ExecutorLostFailure happens, it only displays ExecutorLostFailure (executor lost) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4167) Schedule task on Executor will be Imbalance while task run less than local-wait time
[ https://issues.apache.org/jira/browse/SPARK-4167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SuYan updated SPARK-4167: - Description: Recently, when run a spark on yarn job. it occurs executor schedules imbalance. the procedure is that: 1. because user's mistake, the spark on yarn job's input split contains 0 byte empty splits. 1.1: task0-99 , no-preference task(0 byte) task100-800, node-local task 1.2: user will run task 500 loops 1.3: 60 executor 2. executor A only have 2 node-local task in the first loop, executor A first finished node-local-task, the it will run no-preference task, and the no-preference task in our situation have smaller input split than node-local task. So executor A finished all no-reference task, while others still run node-local job. in the second loop, all task have process-local level, and all task finished in 3 seconds, so while executor A is still run process-local task while others are all finished process-local task. but all process-task run by executor A will finished in 3 seconds, so the local level will always be process-local. it results other executors are all wait for executor A the same situation in the left loops. To solve this situation, we let user to delete the empty input split. but is still have implied imbalance, while in some loops, a executor got more process-local task than others in one loop, and this task all less-3 seconds task. and then in the left loops, the others executor will wait that executor to finished all process-local tasks. was: Recently, when run a spark on yarn job. it occurs executor schedules imbalance. the procedure is that: 1. because user's mistake, the spark on yarn job's input split contains 0 byte empty splits. 1.1: task0-99 , no-preference task(0 byte) task100-800, node-local task 1.2: user will run task 500 loops 1.3: 60 executor 2. executor A only have 2 node-local task in the first loop, executor A first finished node-local-task, the it will run no-preference task, and the no-preference task in our situation have smaller input split than node-local task. So executor A finished all no-reference task, while others still run node-local job. in the second loop, all task have process-local level, and all task finished in 3 seconds, so while executor A is still run process-local task while others are all finished process-local task. but all process-task run by executor A will finished in 3 seconds, so the local level will always be process-local. it results other executors are all wait for executor A the same situation in the left loops. To solve this situation, we let user to delete the empty input split. but is still have implied imbalance, while in some loops, a executor got more process-local task than others in one loop, and this task all less-3 seconds task. and then in the left loops, the others executor will wait that executor to finished all process-local tasks. Schedule task on Executor will be Imbalance while task run less than local-wait time Key: SPARK-4167 URL: https://issues.apache.org/jira/browse/SPARK-4167 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: SuYan Recently, when run a spark on yarn job. it occurs executor schedules imbalance. the procedure is that: 1. because user's mistake, the spark on yarn job's input split contains 0 byte empty splits. 1.1: task0-99 , no-preference task(0 byte) task100-800, node-local task 1.2: user will run task 500 loops 1.3: 60 executor 2. executor A only have 2 node-local task in the first loop, executor A first finished node-local-task, the it will run no-preference task, and the no-preference task in our situation have smaller input split than node-local task. So executor A finished all no-reference task, while others still run node-local job. in the second loop, all task have process-local level, and all task finished in 3 seconds, so while executor A is still run process-local task while others are all finished process-local task. but all process-task run by executor A will finished in 3 seconds, so the local level will always be process-local. it results other executors are all wait for executor A the same situation in the left loops. To solve this situation, we let user to delete the empty input split. but is still have implied imbalance, while in some loops, a executor got more process-local task than others in one loop, and this task all less-3 seconds task. and then in the left loops, the others executor will wait that executor to finished all process-local tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-4167) Schedule task on Executor will be Imbalance while task run less than local-wait time
[ https://issues.apache.org/jira/browse/SPARK-4167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SuYan updated SPARK-4167: - Description: Recently, when run a spark on yarn job. it occurs executor schedules imbalance. the procedure is that: 1. because user's mistake, the spark on yarn job's input split contains 0 byte empty splits. 1.1: task0-99 , no-preference task(0 byte) task100-800, node-local task 1.2: user will run task 500 loops 1.3: 60 executor 2. executor A only have 2 node-local task in the first loop, executor A first finished node-local-task, the it will run no-preference task, and the no-preference task in our situation have smaller input split than node-local task. So executor A finished all no-reference task, while others still run node-local job. in the second loop, all task have process-local level, and all task finished in 3 seconds, so while executor A is still run process-local task while others are all finished process-local task. but all process-task run by executor A will finished in 3 seconds, so the local level will always be process-local. it results other executors are all wait for executor A the same situation in the left loops. To solve this situation, we let user to delete the empty input split. but is still have implied imbalance, while in some loops, a executor got more process-local task than others in one loop, and this task all less-3 seconds task. and then in the left loops, the others executor will wait that executor to finished all process-local tasks. was: Recently, when run a spark on yarn job. it occurs executor schedules imbalance. the procedure is that: 1. because user's mistake, the spark on yarn job's input split contains 0 byte empty splits. 1.1: task0-99 , no-preference task(0 byte) task100-800, node-local task 1.2: user will run task 500 loops 1.3: 60 executor 2. executor A only have 2 node-local task in the first loop, executor A first finished node-local-task, the it will run no-preference task, and the no-preference task in our situation have smaller input split than node-local task. So executor A finished all no-reference task, while others still run node-local job. in the second loop, all task have process-local level, and all task finished in 3 seconds, so while executor A is still run process-local task while others are all finished process-local task. but all process-task run by executor A will finished in 3 seconds, so the local level will always be process-local. it results other executors are all wait for executor A the same situation in the left loops. To solve this situation, we let user to delete the empty input split. but is still have implied imbalance, while in some loops, a executor got more process-local task than others in one loop, and this task all less-3 seconds task. and then in the left loops, the others executor will wait that executor to finished all process-local tasks. Schedule task on Executor will be Imbalance while task run less than local-wait time Key: SPARK-4167 URL: https://issues.apache.org/jira/browse/SPARK-4167 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: SuYan Recently, when run a spark on yarn job. it occurs executor schedules imbalance. the procedure is that: 1. because user's mistake, the spark on yarn job's input split contains 0 byte empty splits. 1.1: task0-99 , no-preference task(0 byte) task100-800, node-local task 1.2: user will run task 500 loops 1.3: 60 executor 2. executor A only have 2 node-local task in the first loop, executor A first finished node-local-task, the it will run no-preference task, and the no-preference task in our situation have smaller input split than node-local task. So executor A finished all no-reference task, while others still run node-local job. in the second loop, all task have process-local level, and all task finished in 3 seconds, so while executor A is still run process-local task while others are all finished process-local task. but all process-task run by executor A will finished in 3 seconds, so the local level will always be process-local. it results other executors are all wait for executor A the same situation in the left loops. To solve this situation, we let user to delete the empty input split. but is still have implied imbalance, while in some loops, a executor got more process-local task than others in one loop, and this task all less-3 seconds task. and then in the left loops, the others executor will wait that executor to finished all process-local tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191551#comment-14191551 ] Debasish Das commented on SPARK-2426: - [~mengxr] The matlab comparison scripts are open sourced over here: https://github.com/debasish83/ecos/blob/master/matlab/admm/qprandom.m https://github.com/debasish83/ecos/blob/master/matlab/pdco4/code/pdcotestQP.m The detailed comparisons are on the REAME.md. Please look at the section on Matlab comparisons. In a nutshell, for bounds MOSEK and ADMM are similar, for elastic net Proximal is 10X faster compared to MOSEK, for equality MOSEK is 2-3X faster than Proximal but both PDCO and ECOS produces much worse result as compared to ADMM. Accelerated ADMM also did not work as good as default ADMM. Increasing the over-relaxation parameter helped accelerated ADMM but I have not explored it yet. ADMM and PDCO are in Matlab but ECOS and MOSEK are both using mex files so they are expected to be more efficient. Next I will add the performance results of running positivity, box, sparse coding / regularized lsi and robust-plsa on MovieLens dataset and validate product recommendation using the MAP measure...In terms of RMSE, default positive sparse coding... What's the largest datasets LDA PRs are running? I would like to try that on sparse coding as well...From these papers sparse coding/RLSI should give results at par with LDA: https://www.cs.cmu.edu/~xichen/images/SLSA-sdm11-final.pdf http://web.stanford.edu/group/mmds/slides2012/s-hli.pdf The same randomized matrices can be generated and run in the PR as follows: ./bin/spark-class org.apache.spark.mllib.optimization.QuadraticMinimizer 1000 1 1.0 0.99 rank=1000, equality=1.0 lambda=1.0 beta=0.99 L1regularization = lambda*beta L2regularization = lambda*(1-beta) Generating randomized QPs with rank 1000 equalities 1 sparseQp 88.423 ms iterations 45 converged true posQp 181.369 ms iterations 121 converged true boundsQp 175.733 ms iterations 121 converged true Qp Equality 2805.564 ms iterations 2230 converged true Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191551#comment-14191551 ] Debasish Das edited comment on SPARK-2426 at 10/31/14 8:04 AM: --- [~mengxr] The matlab comparison scripts are open sourced over here: https://github.com/debasish83/ecos/blob/master/matlab/admm/qprandom.m https://github.com/debasish83/ecos/blob/master/matlab/pdco4/code/pdcotestQP.m The detailed comparisons are on the REAME.md. Please look at the section on Matlab comparisons. In a nutshell, for bounds MOSEK and ADMM are similar, for elastic net Proximal is 10X faster compared to MOSEK, for equality MOSEK is 2-3X faster than Proximal but both PDCO and ECOS produces much worse result as compared to ADMM. Accelerated ADMM also did not work as good as default ADMM. Increasing the over-relaxation parameter helps ADMM but I have not explored it yet. ADMM and PDCO are in Matlab but ECOS and MOSEK are both using mex files so they are expected to be more efficient. Next I will add the performance results of running positivity, box, sparse coding / regularized lsi and robust-plsa on MovieLens dataset and validate product recommendation using the MAP measure...In terms of RMSE, default positive sparse coding... What's the largest datasets LDA PRs are running? I would like to try that on sparse coding as well...From these papers sparse coding/RLSI should give results at par with LDA: https://www.cs.cmu.edu/~xichen/images/SLSA-sdm11-final.pdf http://web.stanford.edu/group/mmds/slides2012/s-hli.pdf The same randomized matrices can be generated and run in the PR as follows: ./bin/spark-class org.apache.spark.mllib.optimization.QuadraticMinimizer 1000 1 1.0 0.99 rank=1000, equality=1.0 lambda=1.0 beta=0.99 L1regularization = lambda*beta L2regularization = lambda*(1-beta) Generating randomized QPs with rank 1000 equalities 1 sparseQp 88.423 ms iterations 45 converged true posQp 181.369 ms iterations 121 converged true boundsQp 175.733 ms iterations 121 converged true Qp Equality 2805.564 ms iterations 2230 converged true was (Author: debasish83): [~mengxr] The matlab comparison scripts are open sourced over here: https://github.com/debasish83/ecos/blob/master/matlab/admm/qprandom.m https://github.com/debasish83/ecos/blob/master/matlab/pdco4/code/pdcotestQP.m The detailed comparisons are on the REAME.md. Please look at the section on Matlab comparisons. In a nutshell, for bounds MOSEK and ADMM are similar, for elastic net Proximal is 10X faster compared to MOSEK, for equality MOSEK is 2-3X faster than Proximal but both PDCO and ECOS produces much worse result as compared to ADMM. Accelerated ADMM also did not work as good as default ADMM. Increasing the over-relaxation parameter helped accelerated ADMM but I have not explored it yet. ADMM and PDCO are in Matlab but ECOS and MOSEK are both using mex files so they are expected to be more efficient. Next I will add the performance results of running positivity, box, sparse coding / regularized lsi and robust-plsa on MovieLens dataset and validate product recommendation using the MAP measure...In terms of RMSE, default positive sparse coding... What's the largest datasets LDA PRs are running? I would like to try that on sparse coding as well...From these papers sparse coding/RLSI should give results at par with LDA: https://www.cs.cmu.edu/~xichen/images/SLSA-sdm11-final.pdf http://web.stanford.edu/group/mmds/slides2012/s-hli.pdf The same randomized matrices can be generated and run in the PR as follows: ./bin/spark-class org.apache.spark.mllib.optimization.QuadraticMinimizer 1000 1 1.0 0.99 rank=1000, equality=1.0 lambda=1.0 beta=0.99 L1regularization = lambda*beta L2regularization = lambda*(1-beta) Generating randomized QPs with rank 1000 equalities 1 sparseQp 88.423 ms iterations 45 converged true posQp 181.369 ms iterations 121 converged true boundsQp 175.733 ms iterations 121 converged true Qp Equality 2805.564 ms iterations 2230 converged true Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit.
[jira] [Resolved] (SPARK-4162) Make scripts symlinkable
[ https://issues.apache.org/jira/browse/SPARK-4162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4162. -- Resolution: Duplicate Duplicate of https://issues.apache.org/jira/browse/SPARK-3482 and https://issues.apache.org/jira/browse/SPARK-2960 Have a look at the PR for 3482 and suggest changes. This has come up several times so would be good to get it fixed. Make scripts symlinkable - Key: SPARK-4162 URL: https://issues.apache.org/jira/browse/SPARK-4162 Project: Spark Issue Type: Improvement Components: Deploy, EC2, Spark Shell Affects Versions: 1.1.0 Environment: Mac, linux Reporter: Shay Seng Scripts are not symlink-able because they all use: FWDIR=$(cd `dirname $0`/..; pwd) to detect the parent Spark dir, which doesn't take into account symlinks. Instead replace the above line with: SOURCE=$0; SCRIPT=`basename $SOURCE`; while [ -h $SOURCE ]; do SCRIPT=`basename $SOURCE`; LOOKUP=`ls -ld $SOURCE`; TARGET=`expr $LOOKUP : '.*- \(.*\)$'`; if expr ${TARGET:-.}/ : '/.*/$' /dev/null; then SOURCE=${TARGET:-.}; else SOURCE=`dirname $SOURCE`/${TARGET:-.}; fi; done; FWDIR=$(cd `dirname $SOURCE`/..; pwd) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4167) Schedule task on Executor will be Imbalance while task run less than local-wait time
[ https://issues.apache.org/jira/browse/SPARK-4167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SuYan closed SPARK-4167. Resolution: Not a Problem Schedule task on Executor will be Imbalance while task run less than local-wait time Key: SPARK-4167 URL: https://issues.apache.org/jira/browse/SPARK-4167 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: SuYan Recently, when run a spark on yarn job. it occurs executor schedules imbalance. the procedure is that: 1. because user's mistake, the spark on yarn job's input split contains 0 byte empty splits. 1.1: task0-99 , no-preference task(0 byte) task100-800, node-local task 1.2: user will run task 500 loops 1.3: 60 executor 2. executor A only have 2 node-local task in the first loop, executor A first finished node-local-task, the it will run no-preference task, and the no-preference task in our situation have smaller input split than node-local task. So executor A finished all no-reference task, while others still run node-local job. in the second loop, all task have process-local level, and all task finished in 3 seconds, so while executor A is still run process-local task while others are all finished process-local task. but all process-task run by executor A will finished in 3 seconds, so the local level will always be process-local. it results other executors are all wait for executor A the same situation in the left loops. To solve this situation, we let user to delete the empty input split. but is still have implied imbalance, while in some loops, a executor got more process-local task than others in one loop, and this task all less-3 seconds task. and then in the left loops, the others executor will wait that executor to finished all process-local tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4168) Completed Stages Number are misleading webUI when stages are more than 1000
Zhang, Liye created SPARK-4168: -- Summary: Completed Stages Number are misleading webUI when stages are more than 1000 Key: SPARK-4168 URL: https://issues.apache.org/jira/browse/SPARK-4168 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Zhang, Liye The number of completed stages and failed stages showed on webUI will always be less than 1000. This is really misleading when there are already thousands of stages completed or failed. The number should be correct even when only partial of all stages listed on the webUI (stage info will be removed if the number is too large). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4168) Completed Stages Number are misleading webUI when stages are more than 1000
[ https://issues.apache.org/jira/browse/SPARK-4168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191643#comment-14191643 ] Apache Spark commented on SPARK-4168: - User 'liyezhang556520' has created a pull request for this issue: https://github.com/apache/spark/pull/3035 Completed Stages Number are misleading webUI when stages are more than 1000 --- Key: SPARK-4168 URL: https://issues.apache.org/jira/browse/SPARK-4168 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Zhang, Liye The number of completed stages and failed stages showed on webUI will always be less than 1000. This is really misleading when there are already thousands of stages completed or failed. The number should be correct even when only partial of all stages listed on the webUI (stage info will be removed if the number is too large). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4169) [Core] Locale dependent code
Niklas Wilcke created SPARK-4169: Summary: [Core] Locale dependent code Key: SPARK-4169 URL: https://issues.apache.org/jira/browse/SPARK-4169 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: Debian, Locale: de_DE Reporter: Niklas Wilcke Fix For: 1.2.0 With a non english locale the method isBindCollision in core/src/main/scala/org/apache/spark/util/Utils.scala doesn't work because it checks the exception message, which is locale dependent. The test suite core/src/test/scala/org/apache/spark/util/UtilsSuite.scala also contains a locale dependent test string formatting of time durations which uses a DecimalSeperator which is locale dependent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4170) Closure problems when running Scala app that extends App
Sean Owen created SPARK-4170: Summary: Closure problems when running Scala app that extends App Key: SPARK-4170 URL: https://issues.apache.org/jira/browse/SPARK-4170 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Sean Owen Priority: Minor Michael Albert noted this problem on the mailing list (http://apache-spark-user-list.1001560.n3.nabble.com/BUG-when-running-as-quot-extends-App-quot-closures-don-t-capture-variables-td17675.html): {code} object DemoBug extends App { val conf = new SparkConf() val sc = new SparkContext(conf) val rdd = sc.parallelize(List(A,B,C,D)) val str1 = A val rslt1 = rdd.filter(x = { x != A }).count val rslt2 = rdd.filter(x = { str1 != null x != A }).count println(DemoBug: rslt1 = + rslt1 + rslt2 = + rslt2) } {code} This produces the output: {code} DemoBug: rslt1 = 3 rslt2 = 0 {code} If instead there is a proper main(), it works as expected. I also this week noticed that in a program which extends App, some values were inexplicably null in a closure. When changing to use main(), it was fine. I assume there is a problem with variables not being added to the closure when main() doesn't appear in the standard way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4169) [Core] Locale dependent code
[ https://issues.apache.org/jira/browse/SPARK-4169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Wilcke updated SPARK-4169: - Description: With a non english locale the method isBindCollision in core/src/main/scala/org/apache/spark/util/Utils.scala doesn't work because it checks the exception message, which is locale dependent. The test suite core/src/test/scala/org/apache/spark/util/UtilsSuite.scala also contains a locale dependent test string formatting of time durations which uses a DecimalSeperator which is locale dependent. I created a pull request on github to solve this issue. was: With a non english locale the method isBindCollision in core/src/main/scala/org/apache/spark/util/Utils.scala doesn't work because it checks the exception message, which is locale dependent. The test suite core/src/test/scala/org/apache/spark/util/UtilsSuite.scala also contains a locale dependent test string formatting of time durations which uses a DecimalSeperator which is locale dependent. [Core] Locale dependent code Key: SPARK-4169 URL: https://issues.apache.org/jira/browse/SPARK-4169 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: Debian, Locale: de_DE Reporter: Niklas Wilcke Fix For: 1.2.0 Original Estimate: 0.25h Remaining Estimate: 0.25h With a non english locale the method isBindCollision in core/src/main/scala/org/apache/spark/util/Utils.scala doesn't work because it checks the exception message, which is locale dependent. The test suite core/src/test/scala/org/apache/spark/util/UtilsSuite.scala also contains a locale dependent test string formatting of time durations which uses a DecimalSeperator which is locale dependent. I created a pull request on github to solve this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4169) [Core] Locale dependent code
[ https://issues.apache.org/jira/browse/SPARK-4169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Wilcke updated SPARK-4169: - Labels: patch test (was: ) [Core] Locale dependent code Key: SPARK-4169 URL: https://issues.apache.org/jira/browse/SPARK-4169 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: Debian, Locale: de_DE Reporter: Niklas Wilcke Labels: patch, test Fix For: 1.2.0 Original Estimate: 0.25h Remaining Estimate: 0.25h With a non english locale the method isBindCollision in core/src/main/scala/org/apache/spark/util/Utils.scala doesn't work because it checks the exception message, which is locale dependent. The test suite core/src/test/scala/org/apache/spark/util/UtilsSuite.scala also contains a locale dependent test string formatting of time durations which uses a DecimalSeperator which is locale dependent. I created a pull request on github to solve this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4169) [Core] Locale dependent code
[ https://issues.apache.org/jira/browse/SPARK-4169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191700#comment-14191700 ] Apache Spark commented on SPARK-4169: - User 'numbnut' has created a pull request for this issue: https://github.com/apache/spark/pull/3036 [Core] Locale dependent code Key: SPARK-4169 URL: https://issues.apache.org/jira/browse/SPARK-4169 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: Debian, Locale: de_DE Reporter: Niklas Wilcke Labels: patch, test Fix For: 1.2.0 Original Estimate: 0.25h Remaining Estimate: 0.25h With a non english locale the method isBindCollision in core/src/main/scala/org/apache/spark/util/Utils.scala doesn't work because it checks the exception message, which is locale dependent. The test suite core/src/test/scala/org/apache/spark/util/UtilsSuite.scala also contains a locale dependent test string formatting of time durations which uses a DecimalSeperator which is locale dependent. I created a pull request on github to solve this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4165) Actor with Companion throws ambiguous reference error in REPL
[ https://issues.apache.org/jira/browse/SPARK-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiti Saxena updated SPARK-4165: Affects Version/s: 1.2.0 1.0.1 1.1.0 Actor with Companion throws ambiguous reference error in REPL - Key: SPARK-4165 URL: https://issues.apache.org/jira/browse/SPARK-4165 Project: Spark Issue Type: Bug Affects Versions: 1.0.1, 1.1.0, 1.2.0 Reporter: Shiti Saxena Tried the following in the master branch REPL. {noformat} Spark context available as sc. scala import akka.actor.{Actor,Props} import akka.actor.{Actor, Props} scala :pas // Entering paste mode (ctrl-D to finish) class EchoActor extends Actor{ override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } // Exiting paste mode, now interpreting. defined class EchoActor defined module EchoActor scala EchoActor.props console:15: error: reference to EchoActor is ambiguous; it is imported twice in the same scope by import $VAL1.EchoActor and import INSTANCE.EchoActor EchoActor.props {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4171) StreamingContext.actorStream throws serializationError
Shiti Saxena created SPARK-4171: --- Summary: StreamingContext.actorStream throws serializationError Key: SPARK-4171 URL: https://issues.apache.org/jira/browse/SPARK-4171 Project: Spark Issue Type: Bug Affects Versions: 1.1.0, 1.2.0 Reporter: Shiti Saxena -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4171) StreamingContext.actorStream throws serializationError
[ https://issues.apache.org/jira/browse/SPARK-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiti Saxena updated SPARK-4171: Description: I encountered this issue when StreamingContext.actorStream throws serializationError -- Key: SPARK-4171 URL: https://issues.apache.org/jira/browse/SPARK-4171 Project: Spark Issue Type: Bug Affects Versions: 1.1.0, 1.2.0 Reporter: Shiti Saxena I encountered this issue when -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4171) StreamingContext.actorStream throws serializationError
[ https://issues.apache.org/jira/browse/SPARK-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiti Saxena updated SPARK-4171: Description: I encountered this issue when I was working on https://issues.apache.org/jira/browse/SPARK-3872. Running the following test case on v1.1.0 and the master branch(v1.2.0-SNAPSHOT) throws a serialization error. {noformat} test(actor input stream) { // Set up the streaming context and input streams val ssc = new StreamingContext(conf, batchDuration) val networkStream = ssc.actorStream[String](EchoActor.props, TestActor, // Had to pass the local value of port to prevent from closing over entire scope StorageLevel.MEMORY_AND_DISK) println(created actor) networkStream.print() ssc.start() Thread.sleep(3 * 1000) println(started stream) Thread.sleep(3*1000) logInfo(Stopping server) logInfo(Stopping context) ssc.stop() } class EchoActor extends Actor with ActorHelper { override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } {noformat} was:I encountered this issue when StreamingContext.actorStream throws serializationError -- Key: SPARK-4171 URL: https://issues.apache.org/jira/browse/SPARK-4171 Project: Spark Issue Type: Bug Affects Versions: 1.1.0, 1.2.0 Reporter: Shiti Saxena I encountered this issue when I was working on https://issues.apache.org/jira/browse/SPARK-3872. Running the following test case on v1.1.0 and the master branch(v1.2.0-SNAPSHOT) throws a serialization error. {noformat} test(actor input stream) { // Set up the streaming context and input streams val ssc = new StreamingContext(conf, batchDuration) val networkStream = ssc.actorStream[String](EchoActor.props, TestActor, // Had to pass the local value of port to prevent from closing over entire scope StorageLevel.MEMORY_AND_DISK) println(created actor) networkStream.print() ssc.start() Thread.sleep(3 * 1000) println(started stream) Thread.sleep(3*1000) logInfo(Stopping server) logInfo(Stopping context) ssc.stop() } class EchoActor extends Actor with ActorHelper { override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4171) StreamingContext.actorStream throws serializationError
[ https://issues.apache.org/jira/browse/SPARK-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiti Saxena updated SPARK-4171: Description: I encountered this issue when I was working on https://issues.apache.org/jira/browse/SPARK-3872. Running the following test case on v1.1.0 and the master branch(v1.2.0-SNAPSHOT) throws a serialization error. {noformat} test(actor input stream) { // Set up the streaming context and input streams val ssc = new StreamingContext(conf, batchDuration) val networkStream = ssc.actorStream[String](EchoActor.props, TestActor, // Had to pass the local value of port to prevent from closing over entire scope StorageLevel.MEMORY_AND_DISK) println(created actor) networkStream.print() ssc.start() Thread.sleep(3 * 1000) println(started stream) Thread.sleep(3*1000) logInfo(Stopping server) logInfo(Stopping context) ssc.stop() } {noformat} where EchoActor is defined as {noformat} class EchoActor extends Actor with ActorHelper { override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } {noformat} was: I encountered this issue when I was working on https://issues.apache.org/jira/browse/SPARK-3872. Running the following test case on v1.1.0 and the master branch(v1.2.0-SNAPSHOT) throws a serialization error. {noformat} test(actor input stream) { // Set up the streaming context and input streams val ssc = new StreamingContext(conf, batchDuration) val networkStream = ssc.actorStream[String](EchoActor.props, TestActor, // Had to pass the local value of port to prevent from closing over entire scope StorageLevel.MEMORY_AND_DISK) println(created actor) networkStream.print() ssc.start() Thread.sleep(3 * 1000) println(started stream) Thread.sleep(3*1000) logInfo(Stopping server) logInfo(Stopping context) ssc.stop() } class EchoActor extends Actor with ActorHelper { override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } {noformat} StreamingContext.actorStream throws serializationError -- Key: SPARK-4171 URL: https://issues.apache.org/jira/browse/SPARK-4171 Project: Spark Issue Type: Bug Affects Versions: 1.1.0, 1.2.0 Reporter: Shiti Saxena I encountered this issue when I was working on https://issues.apache.org/jira/browse/SPARK-3872. Running the following test case on v1.1.0 and the master branch(v1.2.0-SNAPSHOT) throws a serialization error. {noformat} test(actor input stream) { // Set up the streaming context and input streams val ssc = new StreamingContext(conf, batchDuration) val networkStream = ssc.actorStream[String](EchoActor.props, TestActor, // Had to pass the local value of port to prevent from closing over entire scope StorageLevel.MEMORY_AND_DISK) println(created actor) networkStream.print() ssc.start() Thread.sleep(3 * 1000) println(started stream) Thread.sleep(3*1000) logInfo(Stopping server) logInfo(Stopping context) ssc.stop() } {noformat} where EchoActor is defined as {noformat} class EchoActor extends Actor with ActorHelper { override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4171) StreamingContext.actorStream throws serializationError
[ https://issues.apache.org/jira/browse/SPARK-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191736#comment-14191736 ] Shiti Saxena commented on SPARK-4171: - After applying the patch from https://github.com/apache/spark/pull/2158, I was able to replicate the issue in the REPL as well, {noformat} Spark context available as sc. scala import org.apache.spark.streaming.receiver.{ActorHelper, Receiver} import org.apache.spark.streaming.receiver.{ActorHelper, Receiver} scala import akka.actor.{Actor,Props} import akka.actor.{Actor, Props} scala import org.apache.spark.streaming._ import org.apache.spark.streaming._ scala Seconds(1) res0: org.apache.spark.streaming.Duration = 1000 ms scala val ssc= new StreamingContext(sc,res0) ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@1b1bca6c scala :pas // Entering paste mode (ctrl-D to finish) class EchoActor extends Actor with ActorHelper { override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } defined class EchoActor defined module EchoActor scala ssc.actorStream[String](EchoActor.props, TestActor) res1: org.apache.spark.streaming.dstream.ReceiverInputDStream[String] = org.apache.spark.streaming.dstream.PluggableInputDStream@56a620b4 scala res1.print() scala ssc.start() 14/10/31 16:52:48 INFO ReceiverTracker: ReceiverTracker started 14/10/31 16:52:48 INFO ForEachDStream: metadataCleanupDelay = -1 14/10/31 16:52:48 INFO PluggableInputDStream: metadataCleanupDelay = -1 14/10/31 16:52:48 INFO PluggableInputDStream: Slide time = 1000 ms 14/10/31 16:52:48 INFO PluggableInputDStream: Storage level = StorageLevel(false, false, false, false, 1) 14/10/31 16:52:48 INFO PluggableInputDStream: Checkpoint interval = null 14/10/31 16:52:48 INFO PluggableInputDStream: Remember duration = 1000 ms 14/10/31 16:52:48 INFO PluggableInputDStream: Initialized and validated org.apache.spark.streaming.dstream.PluggableInputDStream@56a620b4 14/10/31 16:52:48 INFO ForEachDStream: Slide time = 1000 ms 14/10/31 16:52:48 INFO ForEachDStream: Storage level = StorageLevel(false, false, false, false, 1) 14/10/31 16:52:48 INFO ForEachDStream: Checkpoint interval = null 14/10/31 16:52:48 INFO ForEachDStream: Remember duration = 1000 ms 14/10/31 16:52:48 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream@4a5a796 14/10/31 16:52:48 INFO ReceiverTracker: Starting 1 receivers 14/10/31 16:52:48 INFO SparkContext: Starting job: runJob at ReceiverTracker.scala:275 14/10/31 16:52:48 INFO DAGScheduler: Got job 0 (runJob at ReceiverTracker.scala:275) with 1 output partitions (allowLocal=false) 14/10/31 16:52:48 INFO DAGScheduler: Final stage: Stage 0(runJob at ReceiverTracker.scala:275) 14/10/31 16:52:48 INFO DAGScheduler: Parents of final stage: List() 14/10/31 16:52:48 INFO DAGScheduler: Missing parents: List() 14/10/31 16:52:48 INFO DAGScheduler: Submitting Stage 0 (ParallelCollectionRDD[0] at makeRDD at ReceiverTracker.scala:253), which has no missing parents 14/10/31 16:52:48 INFO RecurringTimer: Started timer for JobGenerator at time 1414754569000 14/10/31 16:52:48 INFO JobGenerator: Started JobGenerator at 1414754569000 ms 14/10/31 16:52:48 INFO JobScheduler: Started JobScheduler scala 14/10/31 16:52:48 INFO MemoryStore: ensureFreeSpace(1216) called with curMem=0, maxMem=278302556 14/10/31 16:52:48 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1216.0 B, free 265.4 MB) 14/10/31 16:52:48 INFO TaskSchedulerImpl: Cancelling stage 0 14/10/31 16:52:48 INFO DAGScheduler: Failed to run runJob at ReceiverTracker.scala:275 Exception in thread Thread-38 org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: $line19.$read$$iwC$$iwC$EchoActor$ - field (class $line19.$read$$iwC$$iwC$EchoActor$$anonfun$props$1, name: $outer, type: class $line19.$read$$iwC$$iwC$EchoActor$) - object (class $line19.$read$$iwC$$iwC$EchoActor$$anonfun$props$1, function0) - element of array (index: 1) - array (class [Ljava.lang.Object;, size: 32) - field (class scala.collection.immutable.Vector, name: display0, type: class [Ljava.lang.Object;) - object (class scala.collection.immutable.Vector, Vector(class $line19.$read$$iwC$$iwC$EchoActor, function0)) - field (class akka.actor.Props, name: args, type: interface scala.collection.immutable.Seq) - object (class akka.actor.Props, Props(Deploy(,Config(SimpleConfigObject({})),NoRouter,NoScopeGiven,,),class akka.actor.TypedCreatorFunctionConsumer,Vector(class $line19.$read$$iwC$$iwC$EchoActor, function0))) - field (class org.apache.spark.streaming.receiver.ActorReceiver, name:
[jira] [Updated] (SPARK-4171) StreamingContext.actorStream throws serializationError
[ https://issues.apache.org/jira/browse/SPARK-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiti Saxena updated SPARK-4171: Description: I encountered this issue when I was working on https://issues.apache.org/jira/browse/SPARK-3872. Running the following test case on v1.1.0 and the master branch(v1.2.0-SNAPSHOT) throws a serialization error. {noformat} test(actor input stream) { // Set up the streaming context and input streams val ssc = new StreamingContext(conf, batchDuration) val networkStream = ssc.actorStream[String](EchoActor.props, TestActor, // Had to pass the local value of port to prevent from closing over entire scope StorageLevel.MEMORY_AND_DISK) println(created actor) networkStream.print() ssc.start() Thread.sleep(3 * 1000) println(started stream) Thread.sleep(3*1000) logInfo(Stopping server) logInfo(Stopping context) ssc.stop() } {noformat} where EchoActor is defined as {noformat} class EchoActor extends Actor with ActorHelper { override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } {noformat} The same code works with v1.0.1 was: I encountered this issue when I was working on https://issues.apache.org/jira/browse/SPARK-3872. Running the following test case on v1.1.0 and the master branch(v1.2.0-SNAPSHOT) throws a serialization error. {noformat} test(actor input stream) { // Set up the streaming context and input streams val ssc = new StreamingContext(conf, batchDuration) val networkStream = ssc.actorStream[String](EchoActor.props, TestActor, // Had to pass the local value of port to prevent from closing over entire scope StorageLevel.MEMORY_AND_DISK) println(created actor) networkStream.print() ssc.start() Thread.sleep(3 * 1000) println(started stream) Thread.sleep(3*1000) logInfo(Stopping server) logInfo(Stopping context) ssc.stop() } {noformat} where EchoActor is defined as {noformat} class EchoActor extends Actor with ActorHelper { override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } {noformat} StreamingContext.actorStream throws serializationError -- Key: SPARK-4171 URL: https://issues.apache.org/jira/browse/SPARK-4171 Project: Spark Issue Type: Bug Affects Versions: 1.1.0, 1.2.0 Reporter: Shiti Saxena I encountered this issue when I was working on https://issues.apache.org/jira/browse/SPARK-3872. Running the following test case on v1.1.0 and the master branch(v1.2.0-SNAPSHOT) throws a serialization error. {noformat} test(actor input stream) { // Set up the streaming context and input streams val ssc = new StreamingContext(conf, batchDuration) val networkStream = ssc.actorStream[String](EchoActor.props, TestActor, // Had to pass the local value of port to prevent from closing over entire scope StorageLevel.MEMORY_AND_DISK) println(created actor) networkStream.print() ssc.start() Thread.sleep(3 * 1000) println(started stream) Thread.sleep(3*1000) logInfo(Stopping server) logInfo(Stopping context) ssc.stop() } {noformat} where EchoActor is defined as {noformat} class EchoActor extends Actor with ActorHelper { override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } {noformat} The same code works with v1.0.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3183) Add option for requesting full YARN cluster
[ https://issues.apache.org/jira/browse/SPARK-3183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191781#comment-14191781 ] Gen TANG commented on SPARK-3183: - The same workaround for --num-executors. For the memory, I am thinking to use _yarn.scheduler.maximum-allocation-mb_ as --executor-memory Add option for requesting full YARN cluster --- Key: SPARK-3183 URL: https://issues.apache.org/jira/browse/SPARK-3183 Project: Spark Issue Type: Improvement Components: YARN Reporter: Sandy Ryza This could possibly be in the form of --executor-cores ALL --executor-memory ALL --num-executors ALL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3780) YarnAllocator should look at the container completed diagnostic message
[ https://issues.apache.org/jira/browse/SPARK-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-3780. -- Resolution: Fixed Fix Version/s: 1.2.0 YarnAllocator should look at the container completed diagnostic message --- Key: SPARK-3780 URL: https://issues.apache.org/jira/browse/SPARK-3780 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.1.0 Reporter: Thomas Graves Assignee: Sandy Ryza Fix For: 1.2.0 Yarn will give us a diagnostic message along with a container complete notification. We should print that diagnostic message for the spark user. For instance, I believe if it the container gets shot for being over its memory limit yarn would give us a useful diagnostic saying that. This would be really useful for the user to be able to see. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2220) Fix remaining Hive Commands
[ https://issues.apache.org/jira/browse/SPARK-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191841#comment-14191841 ] Cheng Lian commented on SPARK-2220: --- It turned out that {{ShellCommand}} and {{SourceCommand}} are wrongly interpreted in Spark SQL previously. These two classes correspond to the {{\!}} and {{SOURCE}} syntaxes respectively in Spark SQL. However, back in Hive, {{\!}} is interpreted in different ways by Hive CLI and Beeline, and {{SOURCE}} is only supported by Hive CLI. For {{\!}}, in Hive CLI, {{\!}} starts a shell command (e.g. {{\!ls;}} and {{\!cat foo;}}), while in Beeline {{\!}} starts a Beeline command (e.g. {{\!connect jdbc:hive://localhost:1}} and {{\!run script.sql}}). And the {{SOURCE file}} command in Hive CLI is equivalent to the {{\!run file}} command in Beeline. In a word, functionalities of these two commands should not be implemented in {{sql/core}} and/or {{sql/hive}}, but are already implemented as part of Spark SQL CLI and Hive Beeline. Fix remaining Hive Commands --- Key: SPARK-2220 URL: https://issues.apache.org/jira/browse/SPARK-2220 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian None of the following have an execution plan: {code} private[hive] case class ShellCommand(cmd: String) extends Command private[hive] case class SourceCommand(filePath: String) extends Command private[hive] case class AddFile(filePath: String) extends Command {code} dfs is being fixed in a related PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2220) Fix remaining Hive Commands
[ https://issues.apache.org/jira/browse/SPARK-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191873#comment-14191873 ] Apache Spark commented on SPARK-2220: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/3038 Fix remaining Hive Commands --- Key: SPARK-2220 URL: https://issues.apache.org/jira/browse/SPARK-2220 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian None of the following have an execution plan: {code} private[hive] case class ShellCommand(cmd: String) extends Command private[hive] case class SourceCommand(filePath: String) extends Command private[hive] case class AddFile(filePath: String) extends Command {code} dfs is being fixed in a related PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095232#comment-14095232 ] Debasish Das edited comment on SPARK-2426 at 10/31/14 4:20 PM: --- Hi Xiangrui, The branch is ready for an initial review. I will do lot of clean-up this week. I need some advice on whether we should bring the additional ALS features first or integrate NNLS with QuadraticMinimizer so that we can handle large ranks as well. https://github.com/debasish83/spark/commits/qp-als optimization/QuadraticMinimizer.scala is the placeholder for all QuadraticMinimization. Right now we support 5 features: 1. Least square 2. Quadratic minimization with positivity 3. Quadratic minimization with box : generalization of positivity 4. Quadratic minimization with elastic net :L1 is at 0.99, elastic net control is not given to users 5. Quadratic minimization with affine constraints and bounds There are lot many regularization in Proximal.scala which can be re-used in mllib updater...L1Updater in mllib is an example of Proximal algorithm... QuadraticMinimizer is optimized for direct solve right now (cholesky / lu based on problem we are solving) The CG core from Breeze will be used for iterative solve when ranks are high...I need a different variant of CG for Formulation 5 so Breeze CG is not sufficient for all the formulations this branch supports and needs to be extended.. Right now I am experimenting with ADMM rho and lambda values so that the NNLS iterations are at par with Least square with positivity. The idea for rho and lambda tuning are the following: 1. Derive an optimal value of lambda for quadratic problems, similar to idea of Nesterov's acceleration being used in algorithms like FISTA and accelerated ADMM from UCLA 2. Derive rho from approximate min and max eigenvalues of gram matrix For Matlab based experiments within PDCO, ECOS(IPM), MOSEK and ADMM variants, ADMM is faster with producing result quality within 1e-4 of MOSEK. I will publish the numbers and the matlab script through the ECOS jnilib open source (GPL licensed). I did not add any of ECOS code here so that everything stays Apache. For topic modeling use-case, I expect to produce sparse coding results (L1 on product factors, L2 on user factors) Example runs: NMF: ./bin/spark-submit --total-executor-cores 4 --master spark://localhost:7077 --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --class org.apache.spark.examples.mllib.MovieLensALS ./examples/target/spark-examples_2.10-1.1.0-SNAPSHOT.jar --rank 20 --numIterations 10 --userConstraint POSITIVE --lambdaUser 0.065 --productConstraint POSITIVE --lambdaProduct 0.065 --kryo hdfs://localhost:8020/sandbox/movielens/ Sparse coding: ./bin/spark-submit --total-executor-cores 4 --master spark://localhost:7077 --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --class org.apache.spark.examples.mllib.MovieLensALS ./examples/target/spark-examples_2.10-1.1.0-SNAPSHOT.jar --delimiter --rank 20 --numIterations 10 --userConstraint SMOOTH --lambdaUser 0.065 --productConstraint SPARSE --lambdaProduct 0.065 --kryo hdfs://localhost:8020/sandbox/movielens Robust PLSA with least square loss: ./bin/spark-submit --total-executor-cores 4 --master spark://localhost:7077 --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --class org.apache.spark.examples.mllib.MovieLensALS ./examples/target/spark-examples_2.10-1.1.0-SNAPSHOT.jar --delimiter --rank 20 --numIterations 10 --userConstraint EQUALITY --lambdaUser 0.065 --productConstraint EQUALITY --lambdaProduct 0.065 --kryo hdfs://localhost:8020/sandbox/movielens With this change, users can select to apply user and product specific constraint...basically positive factors for products (interpretability) and smooth for users to get more RMSE improvements. Thanks. Deb was (Author: debasish83): Hi Xiangrui, The branch is ready for an initial review. I will do lot of clean-up this week. I need some advice on whether we should bring the additional ALS features first or integrate NNLS with QuadraticMinimizer so that we can handle large ranks as well. https://github.com/debasish83/spark/commits/qp-als optimization/QuadraticMinimizer.scala is the placeholder for all QuadraticMinimization. Right now we support 5 features: 1. Least square 2. Least square with positivity 3. Least square with bounds : generalization of positivity 4. Least square with equality and positivity/bounds for LDA/PLSA 5. Least square + L1 constraint for sparse NMF There are lot many regularization in Proximal.scala which can be re-used in mllib updater...L1Updater in mllib is an example of Proximal algorithm... QuadraticMinimizer is optimized for direct solve right now (cholesky / lu based on problem we are solving)
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191997#comment-14191997 ] Debasish Das commented on SPARK-2426: - Matlab comparisons of MOSEK, ECOS, PDCO and ADMM are over here: https://github.com/debasish83/ecos/blob/master/README.md MOSEK is available for research purposes. Let me know if there are issues in running the matlab scripts. Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4080) IOException: unexpected exception type while deserializing tasks
[ https://issues.apache.org/jira/browse/SPARK-4080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192005#comment-14192005 ] Josh Rosen commented on SPARK-4080: --- Hi [~kul], Thanks for trying this out! I'm glad to see that my patch improved the error reporting here. What do you mean by creating more than one SparkContext? Are you creating multiple concurrently-running SparkContexts in the same driver JVM? IOException: unexpected exception type while deserializing tasks -- Key: SPARK-4080 URL: https://issues.apache.org/jira/browse/SPARK-4080 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Critical Fix For: 1.1.1, 1.2.0 When deserializing tasks on executors, we sometimes see {{IOException: unexpected exception type}}: {code} java.io.IOException: unexpected exception type java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1538) java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1025) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:163) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} Here are some occurrences of this bug reported on the mailing list and GitHub: - https://www.mail-archive.com/user@spark.apache.org/msg12129.html - http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201409.mbox/%3ccaeawm8uop9tgarm5sceppzey5qxo+h8hu8ujzah5s-ajyzz...@mail.gmail.com%3E - https://github.com/yieldbot/flambo/issues/13 - https://www.mail-archive.com/user@spark.apache.org/msg13283.html This is probably caused by throwing exceptions other than IOException from our custom {{readExternal}} methods (see http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/java/io/ObjectStreamClass.java#1022). [~davies] spotted an instance of this in TorrentBroadcast, where a failed {{require}} throws a different exception, but this issue has been reported in Spark 1.1.0 as well. To fix this, I'm going to add try-catch blocks around all of our {{readExternal}} and {{writeExternal}} methods to re-throw caught exceptions as IOException. This fix should allow us to determine the actual exceptions that are causing deserialization failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2189) Method for removing temp tables created by registerAsTable
[ https://issues.apache.org/jira/browse/SPARK-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192084#comment-14192084 ] Apache Spark commented on SPARK-2189: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/3039 Method for removing temp tables created by registerAsTable -- Key: SPARK-2189 URL: https://issues.apache.org/jira/browse/SPARK-2189 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4172) Progress API in Python
Davies Liu created SPARK-4172: - Summary: Progress API in Python Key: SPARK-4172 URL: https://issues.apache.org/jira/browse/SPARK-4172 Project: Spark Issue Type: New Feature Components: PySpark Reporter: Davies Liu The poll based progress API for Python -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4172) Progress API in Python
[ https://issues.apache.org/jira/browse/SPARK-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192106#comment-14192106 ] Apache Spark commented on SPARK-4172: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3027 Progress API in Python -- Key: SPARK-4172 URL: https://issues.apache.org/jira/browse/SPARK-4172 Project: Spark Issue Type: New Feature Components: PySpark Reporter: Davies Liu Assignee: Davies Liu The poll based progress API for Python -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4016) Allow user to optionally show additional, advanced metrics in the UI
[ https://issues.apache.org/jira/browse/SPARK-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4016. --- Resolution: Fixed Issue resolved by pull request 2867 [https://github.com/apache/spark/pull/2867] Allow user to optionally show additional, advanced metrics in the UI Key: SPARK-4016 URL: https://issues.apache.org/jira/browse/SPARK-4016 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor Fix For: 1.2.0 Allowing the user to show/hide additional metrics will allow us to both (1) add more advanced metrics without cluttering the UI for the average user and (2) hide, by default, some of the metrics currently shown that are not widely used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4141) Hide Accumulators column on stage page when no accumulators exist
[ https://issues.apache.org/jira/browse/SPARK-4141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4141: -- Assignee: (was: Josh Rosen) Hide Accumulators column on stage page when no accumulators exist - Key: SPARK-4141 URL: https://issues.apache.org/jira/browse/SPARK-4141 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Priority: Minor Labels: starter The task table on the details page for each stage has a column for accumulators. We should only show this column if the stage has accumulators, otherwise it clutters the UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4141) Hide Accumulators column on stage page when no accumulators exist
[ https://issues.apache.org/jira/browse/SPARK-4141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-4141: - Assignee: Josh Rosen Hide Accumulators column on stage page when no accumulators exist - Key: SPARK-4141 URL: https://issues.apache.org/jira/browse/SPARK-4141 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Assignee: Josh Rosen Priority: Minor Labels: starter The task table on the details page for each stage has a column for accumulators. We should only show this column if the stage has accumulators, otherwise it clutters the UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3987) NNLS generates incorrect result
[ https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192165#comment-14192165 ] Shuo Xiang commented on SPARK-3987: --- [~debasish83][~mengxr] The condition number for the latest test case is 74.5 and the test case I put in my PR was 2. NNLS generates incorrect result --- Key: SPARK-3987 URL: https://issues.apache.org/jira/browse/SPARK-3987 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Debasish Das Assignee: Shuo Xiang Fix For: 1.1.1, 1.2.0 Hi, Please see the example gram matrix and linear term: val P2 = new DoubleMatrix(20, 20, 333907.312770, -60814.043975, 207935.829941, -162881.367739, -43730.396770, 17511.428983, -243340.496449, -225245.957922, 104700.445881, 32430.845099, 336378.693135, -373497.970207, -41147.159621, 53928.060360, -293517.883778, 53105.278068, 0.00, -85257.781696, 84913.970469, -10584.080103, -60814.043975, 13826.806664, -38032.612640, 33475.833875, 10791.916809, -1040.950810, 48106.552472, 45390.073380, -16310.282190, -2861.455903, -60790.833191, 73109.516544, 9826.614644, -8283.992464, 56991.742991, -6171.366034, 0.00, 19152.382499, -13218.721710, 2793.734234, 207935.829941, -38032.612640, 129661.677608, -101682.098412, -27401.299347, 10787.713362, -151803.006149, -140563.601672, 65067.935324, 20031.263383, 209521.268600, -232958.054688, -25764.179034, 33507.951918, -183046.845592, 32884.782835, 0.00, -53315.811196, 52770.762546, -6642.187643, -162881.367739, 33475.833875, -101682.098412, 85094.407608, 25422.850782, -5437.646141, 124197.166330, 116206.265909, -47093.484134, -11420.168521, -163429.436848, 189574.783900, 23447.172314, -24087.375367, 148311.355507, -20848.385466, 0.00, 46835.814559, -38180.352878, 6415.873901, -43730.396770, 10791.916809, -27401.299347, 25422.850782, 8882.869799, 15.638084, 35933.473986, 34186.371325, -10745.330690, -974.314375, -43537.709621, 54371.010558, 7894.453004, -5408.929644, 42231.381747, -3192.010574, 0.00, 15058.753110, -8704.757256, 2316.581535, 17511.428983, -1040.950810, 10787.713362, -5437.646141, 15.638084, 2794.949847, -9681.950987, -8258.171646, 7754.358930, 4193.359412, 18052.143842, -15456.096769, -253.356253, 4089.672804, -12524.380088, 5651.579348, 0.00, -1513.302547, 6296.461898, 152.427321, -243340.496449, 48106.552472, -151803.006149, 124197.166330, 35933.473986, -9681.950987, 182931.600236, 170454.352953, -72361.174145, -19270.461728, -244518.179729, 279551.060579, 33340.452802, -37103.267653, 219025.288975, -33687.141423, 0.00, 67347.950443, -58673.009647, 8957.800259, -225245.957922, 45390.073380, -140563.601672, 116206.265909, 34186.371325, -8258.171646, 170454.352953, 159322.942894, -66074.960534, -16839.743193, -226173.967766, 260421.044094, 31624.194003, -33839.612565, 203889.695169, -30034.828909, 0.00, 63525.040745, -53572.741748, 8575.071847, 104700.445881, -16310.282190, 65067.935324, -47093.484134, -10745.330690, 7754.358930, -72361.174145, -66074.960534, 35869.598076, 13378.653317, 106033.647837, -111831.682883, -10455.465743, 18537.392481, -88370.612394, 20344.288488, 0.00, -22935.482766, 29004.543704, -2409.461759, 32430.845099, -2861.455903, 20031.263383, -11420.168521, -974.314375, 4193.359412, -19270.461728, -16839.743193, 13378.653317, 6802.081898, 33256.395091, -30421.985199, -1296.785870, 7026.518692, -24443.378205, 9221.982599, 0.00, -4088.076871, 10861.014242, -25.092938, 336378.693135, -60790.833191, 209521.268600, -163429.436848, -43537.709621, 18052.143842, -244518.179729, -226173.967766, 106033.647837, 33256.395091, 339200.268106, -375442.716811, -41027.594509, 54636.778527, -295133.248586, 54177.278365, 0.00, -85237.666701, 85996.957056, -10503.209968, -373497.970207, 73109.516544, -232958.054688, 189574.783900, 54371.010558, -15456.096769, 279551.060579, 260421.044094, -111831.682883, -30421.985199, -375442.716811, 427793.208465, 50528.074431, -57375.986301, 335203.382015, -52676.385869, 0.00, 102368.307670, -90679.792485, 13509.390393, -41147.159621, 9826.614644, -25764.179034, 23447.172314, 7894.453004, -253.356253, 33340.452802, 31624.194003, -10455.465743, -1296.785870, -41027.594509, 50528.074431, 7255.977434, -5281.636812, 39298.355527, -3440.450858, 0.00, 13717.870243, -8471.405582, 2071.812204, 53928.060360, -8283.992464, 33507.951918, -24087.375367, -5408.929644, 4089.672804, -37103.267653, -33839.612565, 18537.392481, 7026.518692, 54636.778527, -57375.986301, -5281.636812, 9735.061160, -45360.674033, 10634.633559, 0.00, -11652.364691, 15039.566630, -1202.539106, -293517.883778, 56991.742991, -183046.845592,
[jira] [Commented] (SPARK-4079) Snappy bundled with Spark does not work on older Linux distributions
[ https://issues.apache.org/jira/browse/SPARK-4079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192188#comment-14192188 ] Kostas Sakellis commented on SPARK-4079: yes, I'm taking this over from Marcelo. Snappy bundled with Spark does not work on older Linux distributions Key: SPARK-4079 URL: https://issues.apache.org/jira/browse/SPARK-4079 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Marcelo Vanzin This issue has existed at least since 1.0, but has been made worse by 1.1 since snappy is now the default compression algorithm. When trying to use it on a CentOS 5 machine, for example, you'll get something like this: {noformat} java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:319) at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:226) at org.xerial.snappy.Snappy.clinit(Snappy.java:48) at org.xerial.snappy.SnappyOutputStream.init(SnappyOutputStream.java:79) at org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125) at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:207) ... Caused by: java.lang.UnsatisfiedLinkError: /tmp/snappy-1.0.5.3-af72bf3c-9dab-43af-a662-f9af657f06b1-libsnappyjava.so: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.9' not found (required by /tmp/snappy-1.0.5.3-af72bf3c-9dab-43af-a662-f9af657f06b1-libsnappyjava.so) at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1957) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1882) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1843) at java.lang.Runtime.load0(Runtime.java:795) at java.lang.System.load(System.java:1061) at org.xerial.snappy.SnappyNativeLoader.load(SnappyNativeLoader.java:39) ... 29 more {noformat} There are two approaches I can see here (well, 3): * Declare CentOS 5 (and similar OSes) not supported, although that would suck for the people who are still on it and already use Spark * Fallback to another compression codec if Snappy cannot be loaded * Ask the Snappy guys to compile the library on an older OS... I think the second would be the best compromise. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3826) Support JDBC/ODBC server with Hive 0.13.1
[ https://issues.apache.org/jira/browse/SPARK-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3826. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2685 [https://github.com/apache/spark/pull/2685] Support JDBC/ODBC server with Hive 0.13.1 - Key: SPARK-3826 URL: https://issues.apache.org/jira/browse/SPARK-3826 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Assignee: wangfei Priority: Blocker Fix For: 1.2.0 Now hive-thriftserver not support hive-0.13, to make it support both 0.12 and 0.13 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4077) A broken string timestamp value can Spark SQL return wrong values for valid string timestamp values
[ https://issues.apache.org/jira/browse/SPARK-4077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4077. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3019 [https://github.com/apache/spark/pull/3019] A broken string timestamp value can Spark SQL return wrong values for valid string timestamp values --- Key: SPARK-4077 URL: https://issues.apache.org/jira/browse/SPARK-4077 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yin Huai Assignee: Venkata Ramana G Fix For: 1.2.0 The following case returns wrong results. The text file is {code} 2014-12-11 00:00:00,1 2014-12-11astring00:00:00,2 {code} The DDL statement and the query are shown below... {code} sql( create external table date_test(my_date timestamp, id int) row format delimited fields terminated by ',' lines terminated by '\n' LOCATION 'dateTest' ) sql(select * from date_test).collect.foreach(println) {code} The result is {code} [1969-12-31 19:00:00.0,1] [null,2] {code} If I change the data to {code} 2014-12-11 00:00:00,1 2014-12-11 00:00:00,2 {code} The result is fine. For the data with broken string timestamp value, I tried runSqlHive. The result is fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4154) Query does not work if it has not between in Spark SQL and HQL
[ https://issues.apache.org/jira/browse/SPARK-4154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4154. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3017 [https://github.com/apache/spark/pull/3017] Query does not work if it has not between in Spark SQL and HQL - Key: SPARK-4154 URL: https://issues.apache.org/jira/browse/SPARK-4154 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Assignee: Ravindra Pesala Fix For: 1.2.0 if the query contains not between does not work. {code} SELECT * FROM src where key not between 10 and 20 {code} It gives the following error {code} Exception in thread main java.lang.RuntimeException: Unsupported language features in query: SELECT * FROM src where key not between 10 and 20 TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME src TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_ALLCOLREF TOK_WHERE TOK_FUNCTION between KW_TRUE TOK_TABLE_OR_COL key 10 20 scala.NotImplementedError: No parse rules for ASTNode type: 256, text: KW_TRUE : KW_TRUE + org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1088) at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:251) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2220) Fix remaining Hive Commands
[ https://issues.apache.org/jira/browse/SPARK-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2220. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3038 [https://github.com/apache/spark/pull/3038] Fix remaining Hive Commands --- Key: SPARK-2220 URL: https://issues.apache.org/jira/browse/SPARK-2220 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Fix For: 1.2.0 None of the following have an execution plan: {code} private[hive] case class ShellCommand(cmd: String) extends Command private[hive] case class SourceCommand(filePath: String) extends Command private[hive] case class AddFile(filePath: String) extends Command {code} dfs is being fixed in a related PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4173) EdgePartitionBuilder uses wrong value for first clustered index
Ankur Dave created SPARK-4173: - Summary: EdgePartitionBuilder uses wrong value for first clustered index Key: SPARK-4173 URL: https://issues.apache.org/jira/browse/SPARK-4173 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.1.0, 1.0.2, 1.2.0 Reporter: Ankur Dave Assignee: Ankur Dave Lines 48 and 49 in EdgePartitionBuilder reference {{srcIds}} before it has been initialized, causing an incorrect value to be stored for the first cluster. https://github.com/apache/spark/blob/23468e7e96bf047ba53806352558b9d661567b23/graphx/src/main/scala/org/apache/spark/graphx/impl/EdgePartitionBuilder.scala#L48-49 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4141) Hide Accumulators column on stage page when no accumulators exist
[ https://issues.apache.org/jira/browse/SPARK-4141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4141. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3031 [https://github.com/apache/spark/pull/3031] Hide Accumulators column on stage page when no accumulators exist - Key: SPARK-4141 URL: https://issues.apache.org/jira/browse/SPARK-4141 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Priority: Minor Labels: starter Fix For: 1.2.0 The task table on the details page for each stage has a column for accumulators. We should only show this column if the stage has accumulators, otherwise it clutters the UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4174) Optionally provide notifications to Receivers when DStream has been generated
Hari Shreedharan created SPARK-4174: --- Summary: Optionally provide notifications to Receivers when DStream has been generated Key: SPARK-4174 URL: https://issues.apache.org/jira/browse/SPARK-4174 Project: Spark Issue Type: Bug Reporter: Hari Shreedharan Receivers receiving data from Message Queues, like Active MQ, Kafka etc can replay messages if required. Using the HDFS WAL mechanism for such systems affects efficiency as we are incurring an unnecessary HDFS write when we can recover the data from the queue anyway. We can fix this by providing a notification to the receiver when the RDD is generated from the blocks. We need to consider the case where a receiver might fail before the RDD is generated and come back on a different executor when the RDD is generated. Either way, this is likely to cause duplicates and not data loss -- so we may be ok. I am thinking about something of the order of accepting a callback function which gets called when the RDD is generated. We can keep the function local in a map of batch id - function, which gets called when the function gets generated (we can inform the ReceiverSupervisorImpl via Akka when the driver generates the RDD). Of course, just an early thought - I will work on a design doc for this one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4174) Streaming: Optionally provide notifications to Receivers when DStream has been generated
[ https://issues.apache.org/jira/browse/SPARK-4174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Shreedharan updated SPARK-4174: Summary: Streaming: Optionally provide notifications to Receivers when DStream has been generated (was: Optionally provide notifications to Receivers when DStream has been generated) Streaming: Optionally provide notifications to Receivers when DStream has been generated Key: SPARK-4174 URL: https://issues.apache.org/jira/browse/SPARK-4174 Project: Spark Issue Type: Bug Reporter: Hari Shreedharan Assignee: Hari Shreedharan Receivers receiving data from Message Queues, like Active MQ, Kafka etc can replay messages if required. Using the HDFS WAL mechanism for such systems affects efficiency as we are incurring an unnecessary HDFS write when we can recover the data from the queue anyway. We can fix this by providing a notification to the receiver when the RDD is generated from the blocks. We need to consider the case where a receiver might fail before the RDD is generated and come back on a different executor when the RDD is generated. Either way, this is likely to cause duplicates and not data loss -- so we may be ok. I am thinking about something of the order of accepting a callback function which gets called when the RDD is generated. We can keep the function local in a map of batch id - function, which gets called when the function gets generated (we can inform the ReceiverSupervisorImpl via Akka when the driver generates the RDD). Of course, just an early thought - I will work on a design doc for this one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4174) Streaming: Optionally provide notifications to Receivers when DStream has been generated
[ https://issues.apache.org/jira/browse/SPARK-4174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Shreedharan updated SPARK-4174: Issue Type: Improvement (was: Bug) Streaming: Optionally provide notifications to Receivers when DStream has been generated Key: SPARK-4174 URL: https://issues.apache.org/jira/browse/SPARK-4174 Project: Spark Issue Type: Improvement Reporter: Hari Shreedharan Assignee: Hari Shreedharan Receivers receiving data from Message Queues, like Active MQ, Kafka etc can replay messages if required. Using the HDFS WAL mechanism for such systems affects efficiency as we are incurring an unnecessary HDFS write when we can recover the data from the queue anyway. We can fix this by providing a notification to the receiver when the RDD is generated from the blocks. We need to consider the case where a receiver might fail before the RDD is generated and come back on a different executor when the RDD is generated. Either way, this is likely to cause duplicates and not data loss -- so we may be ok. I am thinking about something of the order of accepting a callback function which gets called when the RDD is generated. We can keep the function local in a map of batch id - function, which gets called when the function gets generated (we can inform the ReceiverSupervisorImpl via Akka when the driver generates the RDD). Of course, just an early thought - I will work on a design doc for this one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192273#comment-14192273 ] Nicholas Chammas commented on SPARK-3821: - Hey folks, I was hoping to post a design doc here this week and get feedback but I will have to push that back to next week. Been very busy this week and will be away from a computer all weekend. Apologies. Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4150) rdd.setName returns None in PySpark
[ https://issues.apache.org/jira/browse/SPARK-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4150. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3011 [https://github.com/apache/spark/pull/3011] rdd.setName returns None in PySpark --- Key: SPARK-4150 URL: https://issues.apache.org/jira/browse/SPARK-4150 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Trivial Fix For: 1.2.0 We should return self so we can do {code} rdd.setName('abc').cache().count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1267) Add a pip installer for PySpark
[ https://issues.apache.org/jira/browse/SPARK-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192279#comment-14192279 ] Davies Liu commented on SPARK-1267: --- Because PySpark depends on Spark packages, Python user can not use it after 'pip install pyspark', so there is not too much benefits from this. Once we release PySpark separated from Spark, then we should keep the compatability across versions of PySpark and Spark, it will be a nightmare for us (we can not move fast to improve the implementation of PySpark). So, I think we can not do this in near future. [~prabinb], do you mind to close the PR? Add a pip installer for PySpark --- Key: SPARK-1267 URL: https://issues.apache.org/jira/browse/SPARK-1267 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Prabin Banka Priority: Minor Labels: pyspark Please refer to this mail archive, http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3CCAOEPXP7jKiw-3M8eh2giBcs8gEkZ1upHpGb=fqoucvscywj...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3870) EOL character enforcement
[ https://issues.apache.org/jira/browse/SPARK-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3870. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2726 [https://github.com/apache/spark/pull/2726] EOL character enforcement - Key: SPARK-3870 URL: https://issues.apache.org/jira/browse/SPARK-3870 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 1.2.0 Reporter: Kousuke Saruta Priority: Minor Fix For: 1.2.0 We have shell scripts and Windows batch files, so we should enforce proper EOL character. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3640) KinesisUtils should accept a credentials object instead of forcing DefaultCredentialsProvider
[ https://issues.apache.org/jira/browse/SPARK-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192411#comment-14192411 ] Chris Fregly commented on SPARK-3640: - Agreed that this was no ideal when i first chose this implementation. And as you mentioned, the NotSerializableException is exactly why I went with the DefaultCredentialsProvider. So I spent some time trying to solve this using AWS IAM Roles on separate users under your root AWS account. This appears to work well with the existing DefaultCredentialsProvider. Is this a viable option for you? Basically, every user would get their own ACCESS_KEY_ID and SECRET_KEY. This would be used in place of the root credentials. For thoroughness, I've included links to the instructions as well as an example IAM Policy JSON (I'll also add this to the Spark Kinesis Developer Guide (http://spark.apache.org/docs/latest/streaming-kinesis-integration.html): Creating IAM users http://docs.aws.amazon.com/IAM/latest/UserGuide/Using_SettingUpUser.html https://console.aws.amazon.com/iam/home?#security_credential Setting up Kinesis, DynamoDB, and CloudWatch IAM Policy for the new users http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-using-iam.html IAM Policy Generator http://awspolicygen.s3.amazonaws.com/policygen.html Attaching the Custom Policy https://console.aws.amazon.com/iam/home?#users Select the user Select Attach Policy Select Custom Policy IAM Policy JSON This is already generated using the Policy Generator above... just fill in the missing pieces specific to your environment. { Statement: [ { Sid: Stmt1414784467497, Action: kinesis:*, Effect: Allow, Resource: arn:aws:kinesis:region-of-stream:aws-account-id:stream/stream-name }, { Sid: Stmt1414784693732, Action: dynamodb:*, Effect: Allow, Resource: arn:aws:dynamodb:us-east-1:aws-account-id:table/dynamodb-tablename }, { Sid: Stmt1414785131046, Action: cloudwatch:*, Effect: Allow, Resource: * } ] } Notes: * The region of the DynamoDB table is intentionally hard-coded to us-east-1 as this is how Kinesis currently works * The DynamoDB table is the same as the application name of the Kinesis Streaming Application. The sample included with the Spark distribution uses KinesisWordCount for the application/table name. Is this a sufficient workaround. Using IAM Policies is an AWS best practice, but not sure if this aligns with your existing environment. If not, I can continue to investigate exposing that CredentialsProvider Lemme know, Aniket! KinesisUtils should accept a credentials object instead of forcing DefaultCredentialsProvider - Key: SPARK-3640 URL: https://issues.apache.org/jira/browse/SPARK-3640 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Aniket Bhatnagar Labels: kinesis KinesisUtils should accept AWS Credentials as a parameter and should default to DefaultCredentialsProvider if no credentials are provided. Currently, the implementation forces usage of DefaultCredentialsProvider which can be a pain especially when jobs are run by multiple unix users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4158) Spark throws exception when Mesos resources are missing
[ https://issues.apache.org/jira/browse/SPARK-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192448#comment-14192448 ] RJ Nowling commented on SPARK-4158: --- I verified that the associated patch fixes this issue on our local cluster running Spark 1.1.0 and Mesos 0.21. Spark throws exception when Mesos resources are missing --- Key: SPARK-4158 URL: https://issues.apache.org/jira/browse/SPARK-4158 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.1.0 Reporter: Brenden Matthews Spark throws an exception when trying to check resources which haven't been offered by Mesos. This is an error in Spark, and should be corrected as such. Here's a sample: {code} val data Exception in thread Thread-41 java.lang.IllegalArgumentException: No resource called cpus in [name: mem type: SCALAR scalar { value: 2067.0 } role: * , name: disk type: SCALAR scalar { value: 900.0 } role: * , name: ports type: RANGES ranges { range { begin: 31000 end: 32000 } } role: * ] at org.apache.spark.scheduler.cluster.mesos.CoarseMesosSchedulerBackend.org$apache$spark$scheduler$cluster$mesos$CoarseMesosSchedulerBackend$$getResource(CoarseMesosSchedulerBackend.scala:236) at org.apache.spark.scheduler.cluster.mesos.CoarseMesosSchedulerBackend$$anonfun$resourceOffers$1.apply(CoarseMesosSchedulerBackend.scala:200) at org.apache.spark.scheduler.cluster.mesos.CoarseMesosSchedulerBackend$$anonfun$resourceOffers$1.apply(CoarseMesosSchedulerBackend.scala:197) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at org.apache.spark.scheduler.cluster.mesos.CoarseMesosSchedulerBackend.resourceOffers(CoarseMesosSchedulerBackend.scala:197) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192491#comment-14192491 ] Zhijie Shen commented on SPARK-1537: bq. BTW, if you want a list of things I think are important for Spark, here are some quick ones: Thanks for sharing the details, which are more helpful to clean up the puzzles than some big but vague statement. Let me go through the aforementioned Jiras: * YARN-2521: I'd like to keep it open for some further client improvement, such as local timeline data caching, while YARN-2673 already made the client retry when the server temporally doesn't respond. Please note that I think it's pretty critical when you can't upload your data because the server is down is *no longer true* after YARN-2673. On the other side, At the point of view of the API, it should keep stable. * YARN-2423: This is proposed to improve the Java libs by adding GET APIs. They are used to query data, NOT to put data. We do this to help the use case that the developers write Java code to implement the UI to analyze the timeline data. Framework integration mainly deals with PUT APIs, and the Java client libs are already there. Take one step back, apart from the client libs, the RESTful APIs are always there, which is programming language neutral, and useful to non-Java developers. * YARN-2444: It's may be a bug or an improper use case. According to the exception, the user doesn't pass the authorization for some reason. It is reported for 2.5, and is probably no longer valid after we fixed a bunch of security issues for 2.6. We need to do more validation for this issue before a conclusion. Anyway, it's obviously an internal issue happening in secure mode only, which should not the API CHANGES. bq. I understand it doesn't affect the client API and we can still have the code in, It seems that we have the agreement that the current timeline service offering is not blocking the Spark integration work. Add integration with Yarn's Application Timeline Server --- Key: SPARK-1537 URL: https://issues.apache.org/jira/browse/SPARK-1537 Project: Spark Issue Type: New Feature Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin It would be nice to have Spark integrate with Yarn's Application Timeline Server (see YARN-321, YARN-1530). This would allow users running Spark on Yarn to have a single place to go for all their history needs, and avoid having to manage a separate service (Spark's built-in server). At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, although there is still some ongoing work. But the basics are there, and I wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192502#comment-14192502 ] Marcelo Vanzin commented on SPARK-1537: --- bq. This is proposed to improve the Java libs by adding GET APIs. They are used to query data, NOT to put data. Spark needs both to put and read data, otherwise the ATS is useless for Spark. The current goal of Spark is to use the ATS as a store for its history data, since the data itself is not considered public and stable itself. So there is no point in integration if you can only write data. (I know you can read data through other means, but I don't want to write a custom REST client just to get ATS support in.) bq. It is reported for 2.5, and is probably no longer valid after we fixed a bunch of security issues for 2.6. I'm not sure why you say it's security-related since there nothing security-related in the example code I posted. And if something doesn't work in 2.5 but works in 2.6, it means we (and by that I mean Spark) have to restrict our support to the versions where things work - even if the underlying API is exactly the same. Add integration with Yarn's Application Timeline Server --- Key: SPARK-1537 URL: https://issues.apache.org/jira/browse/SPARK-1537 Project: Spark Issue Type: New Feature Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin It would be nice to have Spark integrate with Yarn's Application Timeline Server (see YARN-321, YARN-1530). This would allow users running Spark on Yarn to have a single place to go for all their history needs, and avoid having to manage a separate service (Spark's built-in server). At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, although there is still some ongoing work. But the basics are there, and I wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192538#comment-14192538 ] Zhijie Shen commented on SPARK-1537: bq. Spark needs both to put and read data It's again a vague statement. Can you share your design detail, such that we can evaluate it is really necessary? And what is the actual way of visualizing data? And integration work is not just single bug fix patch, we can divide work into a sequent of sub tasks, and the first step is to enable Spark job to be able to putting the data into the timeline server. By doing this, not only Spark's only web front can visualize job history, it also enable the third-party tools to do Spark job analysis too. bq. I'm not sure why you say it's security-related since there nothing security-related in the example code I posted. I said According to the exception, the user doesn't pass the authorization for some reason. If you don't agree on it, please post your investigation on YARN-2444, YARN folks will help you on this issue. bq. if something doesn't work in 2.5 but works in 2.6, No matter the integration with timeline service, Spark on YARN is picking Hadoop versions now. It doesn't make sense to ask for a feature by using an early version that hasn't it. Add integration with Yarn's Application Timeline Server --- Key: SPARK-1537 URL: https://issues.apache.org/jira/browse/SPARK-1537 Project: Spark Issue Type: New Feature Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin It would be nice to have Spark integrate with Yarn's Application Timeline Server (see YARN-321, YARN-1530). This would allow users running Spark on Yarn to have a single place to go for all their history needs, and avoid having to manage a separate service (Spark's built-in server). At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, although there is still some ongoing work. But the basics are there, and I wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192561#comment-14192561 ] Marcelo Vanzin commented on SPARK-1537: --- bq. It's again a vague statement. I don't know what is vague about wanting to read the data you write. bq. Can you share your design detail I already did way better than that, way earlier in this bug: I shared the actual code. For this particular question, here it is: https://github.com/vanzin/spark/blob/yarn-timeline/yarn/timeline/src/main/scala/org/apache/spark/deploy/yarn/timeline/YarnTimelineProvider.scala See how it reads data from the ATS? It feeds it into the Spark history server, where the data can be visualized. It's using Yarn internal APIs, which is generally bad practice. bq. If you don't agree on it, please post your investigation on YARN-2444, YARN folks will help you on this issue. I posted the error and the code to reproduce it. I don't know what else do you expect from me. If you think it's an authorization issue, test it with 2.6 and close the bug if you believe it's fixed. bq. No matter the integration with timeline service, Spark on YARN is picking Hadoop versions now. It doesn't make sense to ask for a feature by using an early version that hasn't it. I'm not sure I really understood what you're trying to say here. Yes, we have to pick versions. We need a version that supports the features we need. Even if the API in 2.5 didn't change in 2.6, it seems to have bugs that prevent my current code from working, so there is no point in trying to integrate with 2.5 as far as I'm concerned. And as far as I know, 2.6 hasn't been released yet. (BTW, my code used to work with 2.4.) Add integration with Yarn's Application Timeline Server --- Key: SPARK-1537 URL: https://issues.apache.org/jira/browse/SPARK-1537 Project: Spark Issue Type: New Feature Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin It would be nice to have Spark integrate with Yarn's Application Timeline Server (see YARN-321, YARN-1530). This would allow users running Spark on Yarn to have a single place to go for all their history needs, and avoid having to manage a separate service (Spark's built-in server). At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, although there is still some ongoing work. But the basics are there, and I wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4175) Exception on stage page
Sandy Ryza created SPARK-4175: - Summary: Exception on stage page Key: SPARK-4175 URL: https://issues.apache.org/jira/browse/SPARK-4175 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Critical {code} 14/10/31 14:52:58 WARN servlet.ServletHandler: /stages/stage/ java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:313) at scala.None$.get(Option.scala:311) at org.apache.spark.ui.jobs.StagePage.taskRow(StagePage.scala:331) at org.apache.spark.ui.jobs.StagePage$$anonfun$8.apply(StagePage.scala:173) at org.apache.spark.ui.jobs.StagePage$$anonfun$8.apply(StagePage.scala:173) at org.apache.spark.ui.UIUtils$$anonfun$listingTable$2.apply(UIUtils.scala:282) at org.apache.spark.ui.UIUtils$$anonfun$listingTable$2.apply(UIUtils.scala:282) at scala.collection.immutable.Stream.map(Stream.scala:376) at org.apache.spark.ui.UIUtils$.listingTable(UIUtils.scala:282) at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:171) at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68) at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68) at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:68) at javax.servlet.http.HttpServlet.service(HttpServlet.java:735) at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1496) at org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:722) {code} I'm guessing this was caused by SPARK-4016? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4016) Allow user to optionally show additional, advanced metrics in the UI
[ https://issues.apache.org/jira/browse/SPARK-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192604#comment-14192604 ] Sandy Ryza commented on SPARK-4016: --- It looks like after this change, stage-level summary metrics no longer include in-progress tasks. Is this on purpose? Allow user to optionally show additional, advanced metrics in the UI Key: SPARK-4016 URL: https://issues.apache.org/jira/browse/SPARK-4016 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor Fix For: 1.2.0 Allowing the user to show/hide additional metrics will allow us to both (1) add more advanced metrics without cluttering the UI for the average user and (2) hide, by default, some of the metrics currently shown that are not widely used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4016) Allow user to optionally show additional, advanced metrics in the UI
[ https://issues.apache.org/jira/browse/SPARK-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192609#comment-14192609 ] Sandy Ryza commented on SPARK-4016: --- Also, it looks like this can cause an exception: SPARK-4175 Allow user to optionally show additional, advanced metrics in the UI Key: SPARK-4016 URL: https://issues.apache.org/jira/browse/SPARK-4016 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor Fix For: 1.2.0 Allowing the user to show/hide additional metrics will allow us to both (1) add more advanced metrics without cluttering the UI for the average user and (2) hide, by default, some of the metrics currently shown that are not widely used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4175) Exception on stage page
[ https://issues.apache.org/jira/browse/SPARK-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192608#comment-14192608 ] Apache Spark commented on SPARK-4175: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/3043 Exception on stage page --- Key: SPARK-4175 URL: https://issues.apache.org/jira/browse/SPARK-4175 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Critical {code} 14/10/31 14:52:58 WARN servlet.ServletHandler: /stages/stage/ java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:313) at scala.None$.get(Option.scala:311) at org.apache.spark.ui.jobs.StagePage.taskRow(StagePage.scala:331) at org.apache.spark.ui.jobs.StagePage$$anonfun$8.apply(StagePage.scala:173) at org.apache.spark.ui.jobs.StagePage$$anonfun$8.apply(StagePage.scala:173) at org.apache.spark.ui.UIUtils$$anonfun$listingTable$2.apply(UIUtils.scala:282) at org.apache.spark.ui.UIUtils$$anonfun$listingTable$2.apply(UIUtils.scala:282) at scala.collection.immutable.Stream.map(Stream.scala:376) at org.apache.spark.ui.UIUtils$.listingTable(UIUtils.scala:282) at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:171) at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68) at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68) at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:68) at javax.servlet.http.HttpServlet.service(HttpServlet.java:735) at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1496) at org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:722) {code} I'm guessing this was caused by SPARK-4016? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3561) Allow for pluggable execution contexts in Spark
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3561: - Fix Version/s: (was: 1.2.0) Allow for pluggable execution contexts in Spark --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@Experimental) not exposed to end users of Spark. The trait will define 6 operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob * persist * unpersist Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4016) Allow user to optionally show additional, advanced metrics in the UI
[ https://issues.apache.org/jira/browse/SPARK-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192653#comment-14192653 ] Kay Ousterhout commented on SPARK-4016: --- [~sandyr] definitely not intentional to change the behavior of stage-level summary metrics -- can you clarify where you're seeing this? Allow user to optionally show additional, advanced metrics in the UI Key: SPARK-4016 URL: https://issues.apache.org/jira/browse/SPARK-4016 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor Fix For: 1.2.0 Allowing the user to show/hide additional metrics will allow us to both (1) add more advanced metrics without cluttering the UI for the average user and (2) hide, by default, some of the metrics currently shown that are not widely used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4016) Allow user to optionally show additional, advanced metrics in the UI
[ https://issues.apache.org/jira/browse/SPARK-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192655#comment-14192655 ] Kay Ousterhout commented on SPARK-4016: --- (I think the summary table was always for only finished tasks, as controlled by this line: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala#L169) Allow user to optionally show additional, advanced metrics in the UI Key: SPARK-4016 URL: https://issues.apache.org/jira/browse/SPARK-4016 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor Fix For: 1.2.0 Allowing the user to show/hide additional metrics will allow us to both (1) add more advanced metrics without cluttering the UI for the average user and (2) hide, by default, some of the metrics currently shown that are not widely used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4176) Support decimals with precision 18 in Parquet
Matei Zaharia created SPARK-4176: Summary: Support decimals with precision 18 in Parquet Key: SPARK-4176 URL: https://issues.apache.org/jira/browse/SPARK-4176 Project: Spark Issue Type: New Feature Components: SQL Reporter: Matei Zaharia After https://issues.apache.org/jira/browse/SPARK-3929, only decimals with precisions = 18 (that can be read into a Long) will be readable from Parquet, so we still need more work to support these larger ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3652) upgrade spark sql hive version to 0.13.1
[ https://issues.apache.org/jira/browse/SPARK-3652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangfei resolved SPARK-3652. Resolution: Fixed upgrade spark sql hive version to 0.13.1 Key: SPARK-3652 URL: https://issues.apache.org/jira/browse/SPARK-3652 Project: Spark Issue Type: Dependency upgrade Components: SQL Affects Versions: 1.1.0 Reporter: wangfei now spark sql hive version is 0.12.0, compile with 0.13.1 will get errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3322) ConnectionManager logs an error when the application ends
[ https://issues.apache.org/jira/browse/SPARK-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192701#comment-14192701 ] wangfei commented on SPARK-3322: yes, to close this. ConnectionManager logs an error when the application ends - Key: SPARK-3322 URL: https://issues.apache.org/jira/browse/SPARK-3322 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: wangfei Athough it does not influence the result, it always would log an error from ConnectionManager. Sometimes only log ConnectionManagerId(vm2,40992) not found and sometimes it also will log CancelledKeyException The log Info as fellow: 14/08/29 16:54:53 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(vm2,40992) not found 14/08/29 16:54:53 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@457245f9 java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2460) Optimize SparkContext.hadoopFile api
[ https://issues.apache.org/jira/browse/SPARK-2460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangfei closed SPARK-2460. -- Resolution: Fixed Optimize SparkContext.hadoopFile api - Key: SPARK-2460 URL: https://issues.apache.org/jira/browse/SPARK-2460 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: wangfei Fix For: 1.2.0 1 use SparkContext.hadoopRDD() instead of instantiate HadoopRDD directly in SparkContext.hadoopFile 2 broadcast jobConf in HadoopRDD, not Configuration -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4177) update build doc for JDBC/CLI already supporting hive 13
wangfei created SPARK-4177: -- Summary: update build doc for JDBC/CLI already supporting hive 13 Key: SPARK-4177 URL: https://issues.apache.org/jira/browse/SPARK-4177 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.2.0 fix build doc since already support hive 13 in jdbc/cli -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4177) update build doc for already supporting hive 13 in jdbc/cli
[ https://issues.apache.org/jira/browse/SPARK-4177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192707#comment-14192707 ] Apache Spark commented on SPARK-4177: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/3042 update build doc for already supporting hive 13 in jdbc/cli --- Key: SPARK-4177 URL: https://issues.apache.org/jira/browse/SPARK-4177 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.2.0 fix build doc since already support hive 13 in jdbc/cli -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3974) Block matrix abstracitons and partitioners
[ https://issues.apache.org/jira/browse/SPARK-3974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192731#comment-14192731 ] Burak Yavuz commented on SPARK-3974: Hi everyone, The design doc for Block Matrix abstractions and the work on matrix multiplication can be found here: goo.gl/zbU1Nz Let me know if you have any comments / suggestions. I will have the PR for this ready by next Friday hopefully. Block matrix abstracitons and partitioners -- Key: SPARK-3974 URL: https://issues.apache.org/jira/browse/SPARK-3974 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh Assignee: Burak Yavuz We need abstractions for block matrices with fixed block sizes, with each block being dense. Partitioners along both rows and columns required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4178) Hadoop input metrics ignore bytes read in RecordReader instantiation
Sandy Ryza created SPARK-4178: - Summary: Hadoop input metrics ignore bytes read in RecordReader instantiation Key: SPARK-4178 URL: https://issues.apache.org/jira/browse/SPARK-4178 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4178) Hadoop input metrics ignore bytes read in RecordReader instantiation
[ https://issues.apache.org/jira/browse/SPARK-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192773#comment-14192773 ] Sandy Ryza commented on SPARK-4178: --- Thanks [~kostas] for noticing this. Hadoop input metrics ignore bytes read in RecordReader instantiation Key: SPARK-4178 URL: https://issues.apache.org/jira/browse/SPARK-4178 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4178) Hadoop input metrics ignore bytes read in RecordReader instantiation
[ https://issues.apache.org/jira/browse/SPARK-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192775#comment-14192775 ] Apache Spark commented on SPARK-4178: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/3045 Hadoop input metrics ignore bytes read in RecordReader instantiation Key: SPARK-4178 URL: https://issues.apache.org/jira/browse/SPARK-4178 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4175) Exception on stage page
[ https://issues.apache.org/jira/browse/SPARK-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-4175. --- Resolution: Fixed Fix Version/s: 1.2.0 Exception on stage page --- Key: SPARK-4175 URL: https://issues.apache.org/jira/browse/SPARK-4175 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Critical Fix For: 1.2.0 {code} 14/10/31 14:52:58 WARN servlet.ServletHandler: /stages/stage/ java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:313) at scala.None$.get(Option.scala:311) at org.apache.spark.ui.jobs.StagePage.taskRow(StagePage.scala:331) at org.apache.spark.ui.jobs.StagePage$$anonfun$8.apply(StagePage.scala:173) at org.apache.spark.ui.jobs.StagePage$$anonfun$8.apply(StagePage.scala:173) at org.apache.spark.ui.UIUtils$$anonfun$listingTable$2.apply(UIUtils.scala:282) at org.apache.spark.ui.UIUtils$$anonfun$listingTable$2.apply(UIUtils.scala:282) at scala.collection.immutable.Stream.map(Stream.scala:376) at org.apache.spark.ui.UIUtils$.listingTable(UIUtils.scala:282) at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:171) at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68) at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68) at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:68) at javax.servlet.http.HttpServlet.service(HttpServlet.java:735) at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1496) at org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:722) {code} I'm guessing this was caused by SPARK-4016? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2329) Add multi-label evaluation metrics
[ https://issues.apache.org/jira/browse/SPARK-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2329. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 1270 [https://github.com/apache/spark/pull/1270] Add multi-label evaluation metrics -- Key: SPARK-2329 URL: https://issues.apache.org/jira/browse/SPARK-2329 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Alexander Ulanov Assignee: Alexander Ulanov Fix For: 1.2.0 Original Estimate: 72h Remaining Estimate: 72h There is no class in Spark MLlib for measuring the performance of multi-label classifiers. Multilabel classification is when the document is labeled with several labels (classes). This task involves adding the class for multilabel evaluation and unit tests. The following measures are to be implemented: Precision, Recall and F1-measure (1) based on documents averaged by the number of documents; (2) per label; (3) based on labels micro and macro averaged; (4) Hamming loss. Reference: Tsoumakas, Grigorios, Ioannis Katakis, and Ioannis Vlahavas. Mining multi-label data. Data mining and knowledge discovery handbook. Springer US, 2010. 667-685. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3838) Python code example for Word2Vec in user guide
[ https://issues.apache.org/jira/browse/SPARK-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-3838. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2952 [https://github.com/apache/spark/pull/2952] Python code example for Word2Vec in user guide -- Key: SPARK-3838 URL: https://issues.apache.org/jira/browse/SPARK-3838 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: Anant Daksh Asthana Priority: Trivial Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1547) Add gradient boosting algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1547. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2607 [https://github.com/apache/spark/pull/2607] Add gradient boosting algorithm to MLlib Key: SPARK-1547 URL: https://issues.apache.org/jira/browse/SPARK-1547 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.0 Reporter: Manish Amde Assignee: Manish Amde Fix For: 1.2.0 This task requires adding the gradient boosting algorithm to Spark MLlib. The implementation needs to adapt the gradient boosting algorithm to the scalable tree implementation. The tasks involves: - Comparing the various tradeoffs and finalizing the algorithm before implementation - Code implementation - Unit tests - Functional tests - Performance tests - Documentation [Ensembles design document (Google doc) | https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4127) Streaming Linear Regression- Python bindings
[ https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anant Daksh Asthana updated SPARK-4127: --- Summary: Streaming Linear Regression- Python bindings (was: Streaming Linear Regression) Streaming Linear Regression- Python bindings Key: SPARK-4127 URL: https://issues.apache.org/jira/browse/SPARK-4127 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Anant Daksh Asthana Priority: Minor Create python bindings for Streaming Linear Regression (MLlib). The Mllib file relevant to this issue can be found at : https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3787) Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version
[ https://issues.apache.org/jira/browse/SPARK-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192931#comment-14192931 ] Apache Spark commented on SPARK-3787: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/3046 Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version --- Key: SPARK-3787 URL: https://issues.apache.org/jira/browse/SPARK-3787 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.2.0 Reporter: Kousuke Saruta When we build with sbt with profile for hadoop and without property for hadoop version like: {code} sbt/sbt -Phadoop-2.2 assembly {code} jar name is always used default version (1.0.4). When we build with maven with same condition for sbt, default version for each profile is used. For instance, if we build like: {code} mvn -Phadoop-2.2 package {code} jar name is used hadoop2.2.0 as a default version of hadoop-2.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192935#comment-14192935 ] Debasish Das commented on SPARK-2426: - Refactored QuadraticMinimizer and NNLS from mllib optimization to breeze.optimize.quadratic https://github.com/scalanlp/breeze/pull/321 I will update the PR as well but breeze latest depends on scala 2.11 but spark still uses 2.10 All license and copyright information also moved to breeze. So for spark no changes to license/notice files. Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3254) Streaming K-Means
[ https://issues.apache.org/jira/browse/SPARK-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-3254. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2942 [https://github.com/apache/spark/pull/2942] Streaming K-Means - Key: SPARK-3254 URL: https://issues.apache.org/jira/browse/SPARK-3254 Project: Spark Issue Type: New Feature Components: MLlib, Streaming Reporter: Xiangrui Meng Assignee: Jeremy Freeman Fix For: 1.2.0 Streaming K-Means with proper decay settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1847) Pushdown filters on non-required parquet columns
[ https://issues.apache.org/jira/browse/SPARK-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1847. -- Resolution: Fixed Fix Version/s: 1.2.0 Pushdown filters on non-required parquet columns Key: SPARK-1847 URL: https://issues.apache.org/jira/browse/SPARK-1847 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Assignee: Yash Datta Fix For: 1.2.0 From Andre: TODO: we currently only filter on non-nullable (Parquet REQUIRED) attributes until https://github.com/Parquet/parquet-mr/issues/371 has been resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3968) Use parquet-mr filter2 api in spark sql
[ https://issues.apache.org/jira/browse/SPARK-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3968: - Assignee: Yash Datta Use parquet-mr filter2 api in spark sql --- Key: SPARK-3968 URL: https://issues.apache.org/jira/browse/SPARK-3968 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Yash Datta Assignee: Yash Datta Priority: Minor Fix For: 1.2.0 The parquet-mr project has introduced a new filter api , along with several fixes (like filtering on optional fields) . It can also eliminate entire RowGroups depending on certain statistics like min/max We can leverage that to further improve performance of queries with filters. Also filter2 api introduces ability to create custom filters. We can create a custom filter for the optimized In clause (InSet) , so that elimination happens in the ParquetRecordReader itself (will create a separate ticket for that) . This fixes the below ticket : https://issues.apache.org/jira/browse/SPARK-1847 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1847) Pushdown filters on non-required parquet columns
[ https://issues.apache.org/jira/browse/SPARK-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1847: - Assignee: Yash Datta Pushdown filters on non-required parquet columns Key: SPARK-1847 URL: https://issues.apache.org/jira/browse/SPARK-1847 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Assignee: Yash Datta Fix For: 1.2.0 From Andre: TODO: we currently only filter on non-nullable (Parquet REQUIRED) attributes until https://github.com/Parquet/parquet-mr/issues/371 has been resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org