[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201765#comment-14201765 ] zzc commented on SPARK-2468: @Lianhui Wang, How to view the associated logs with yarn still kill executor's container because it's physical memory beyond allocated memory. I can't find it. Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201766#comment-14201766 ] Aaron Davidson commented on SPARK-2468: --- [~zzcclp] Yes, please do. What's the memory of your YARN executors/containers? With preferDirectBufs off, we should allocate little to no off-heap memory, so these results are surprising. Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201772#comment-14201772 ] zzc commented on SPARK-2468: aa...@databricks.com? Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201774#comment-14201774 ] Aaron Davidson commented on SPARK-2468: --- Yup, that would work. Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201778#comment-14201778 ] Lianhui Wang commented on SPARK-2468: - [~zzcclp] in am's log, you can find this log: Exit status: 143. Diagnostics: Container[container-id]is running beyond physical memory limits. Current usage: 8.3 GB of 8 GB physical memory used; 11.0 GB of 16.8 GB virtual memory used. Killing container. and i already set spark.yarn.executor.memoryOverhead=1024 and executor's memory is 7G. so through above log, i can confirm that executor use big no-heap jvm memory. Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201783#comment-14201783 ] Aaron Davidson commented on SPARK-2468: --- Thanks a lot for those diagnostics. Can you confirm that spark.shuffle.io.preferDirectBufs does show up in the UI as being set properly? Does your workload mainly involve a large shuffle? How big is each partition/how many are there? In addition to the netty buffers (which _should_ be disabled by the config), we also memory map shuffle blocks larger than 2MB. Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4295) [External]Exception throws in SparkSinkSuite although all test cases pass
maji2014 created SPARK-4295: --- Summary: [External]Exception throws in SparkSinkSuite although all test cases pass Key: SPARK-4295 URL: https://issues.apache.org/jira/browse/SPARK-4295 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.0 Reporter: maji2014 Priority: Minor After the first test case, all other test cases throw javax.management.InstanceAlreadyExistsException: org.apache.flume.channel:type=null , exception as followings: 14/11/07 00:24:51 ERROR MonitoredCounterGroup: Failed to register monitored counter group for type: CHANNEL, name: null javax.management.InstanceAlreadyExistsException: org.apache.flume.channel:type=null at com.sun.jmx.mbeanserver.Repository.addMBean(Repository.java:437) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(DefaultMBeanServerInterceptor.java:1898) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:966) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:900) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:324) at com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522) at org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:108) at org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:88) at org.apache.flume.channel.MemoryChannel.start(MemoryChannel.java:345) at org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$2.apply$mcV$sp(SparkSinkSuite.scala:63) at org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$2.apply(SparkSinkSuite.scala:61) at org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$2.apply(SparkSinkSuite.scala:61) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.scalatest.FunSuite.run(FunSuite.scala:1555) at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55) at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563) at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2557) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:2557) at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1044) at
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201812#comment-14201812 ] Aaron Davidson commented on SPARK-2468: --- Looking at the netty code a bit more, it seems that they might unconditionally allocate direct buffers for IO, whether or not direct is preferred. Additionally, they allocate more memory based on the number of cores in your system. The default settings would be roughly 16MB per core, and this might be multiplied by 2 in our current setup since we have independent client and server pools in the same JVM. I'm not certain how executors running in YARN report availableProcessors, but is it possible your machines have 32 or greater cores? This could cause an extra allocation of around 1GB direct heap buffers. Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4289) Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance.
[ https://issues.apache.org/jira/browse/SPARK-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201827#comment-14201827 ] Sean Owen commented on SPARK-4289: -- This is a Hadoop issue, right? I don't know if Spark can address this directly I suppose you could work around this with :silent in the shell. Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance. -- Key: SPARK-4289 URL: https://issues.apache.org/jira/browse/SPARK-4289 Project: Spark Issue Type: Bug Reporter: Corey J. Nolet This one is easy to reproduce. preval job = new Job(sc.hadoopConfiguration)/pre I'm not sure what the solution would be off hand as it's happening when the shell is calling toString() on the instance of Job. The problem is, because of the failure, the instance is never actually assigned to the job val. java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283) at org.apache.hadoop.mapreduce.Job.toString(Job.java:452) at scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324) at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329) at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337) at .init(console:10) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4288) Add Sparse Autoencoder algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-4288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4288: - Description: Are you proposing an implementation? Is it related to the neural network JIRA? Target Version/s: (was: 1.3.0) Issue Type: Wish (was: Bug) Add Sparse Autoencoder algorithm to MLlib -- Key: SPARK-4288 URL: https://issues.apache.org/jira/browse/SPARK-4288 Project: Spark Issue Type: Wish Components: MLlib Reporter: Guoqiang Li Labels: features Are you proposing an implementation? Is it related to the neural network JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201832#comment-14201832 ] Aaron Davidson commented on SPARK-2468: --- [~lianhuiwang] I have created [#3155|https://github.com/apache/spark/pull/3155/files], which I will clean up and try to get in tomorrow, which makes the preferDirectBufs config forcefully disable direct byte buffers from both the server and client pools. Additionally, I have added the conf spark.shuffle.io.maxUsableCores which should allow you to inform the executor how many cores you're actually using, so it will avoid allocating enough memory for all the machine's cores. I hope that simply specifying the maxUsableCores is sufficient to actually fix this issue for you, but the combination should give a higher chance of success. Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201844#comment-14201844 ] zzc commented on SPARK-2468: By the way, My test code: val mapR = textFile.map(line = { .. ((value(1) + _ + date.toString(), url), (flow, 1)) }).reduceByKey((pair1, pair2) = { (pair1._1 + pair2._1, pair1._2 + pair2._2) }, 100) mapR.persist(StorageLevel.MEMORY_AND_DISK_SER) val mapR1 = mapR.groupBy(_._1._1) .mapValues(pairs = { pairs.toList.sortBy(_._2._1).reverse }) .flatMap(values = { values._2 }) .map(values = { values._1._1 + \t + values._1._2 + \t + values._2._1.toString() + \t + values._2._2.toString() }) .saveAsTextFile(outputPath + _1/) val mapR2 = mapR.groupBy(_._1._1) .mapValues(pairs = { pairs.toList.sortBy(_._2._2).reverse }) .flatMap(values = { values._2 }) .map(values = { values._1._1 + \t + values._1._2 + \t + values._2._1.toString() + \t + values._2._2.toString() }) .saveAsTextFile(outputPath + _2/) Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201849#comment-14201849 ] Aaron Davidson commented on SPARK-2468: --- [~zzcclp] Thank you for the writeup. Is it really the case that each of your executors is only using 1 core for its 20GB of RAM? It seems like 5 would be in line with the portion of memory you're using. Also, the sum of your storage and memory fractions exceed 1, so if you're caching any data and then performing a reduction/groupBy, you could actually see an OOM even without this other issue. I would recommend keeping shuffle fraction relatively low unless you have a good reason not to, as it can lead to increased instability. The numbers are relatively close to my expectations, which would estimate netty allocating around 750MB of direct buffer space, thinking that it has 24 cores. With #3155 and maxUsableCores set to 1 (or 5), I hope this issue may be resolved. Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4275) ./sbt/sbt assembly command fails if path has space in the name
[ https://issues.apache.org/jira/browse/SPARK-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201851#comment-14201851 ] jiezhou commented on SPARK-4275: I run the ./sbt/sbt assembly on my mac, the error msg is as follows: usage: dirname path ./sbt/sbt: line 31: /sbt-launch-lib.bash: No such file or directory ./sbt/sbt: line 111: run: command not found obviously the path including space hinder the execution of dirname. ./sbt/sbt assembly command fails if path has space in the name Key: SPARK-4275 URL: https://issues.apache.org/jira/browse/SPARK-4275 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Ravi Kiran Priority: Trivial I have downloaded branch-1.1 for building spark from scratch on my MAC. The path had a space like /Users/rkgurram/VirtualBox VMs/SPARK/spark-branch-1.1, 1) I cd to the above directory 2) Ran ./sbt/sbt assembly The command fails with weird messages -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4283) Spark source code does not correctly import into eclipse
[ https://issues.apache.org/jira/browse/SPARK-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201853#comment-14201853 ] Sean Owen commented on SPARK-4283: -- This is really an Eclipse problem. I don't personally think it's worth the extra weight in the build for this. (Use pull requests, not patches on JIRAs, in Spark.) Spark source code does not correctly import into eclipse Key: SPARK-4283 URL: https://issues.apache.org/jira/browse/SPARK-4283 Project: Spark Issue Type: Bug Components: Build Reporter: Yang Yang Priority: Minor Attachments: spark_eclipse.diff when I import spark src into eclipse, either by mvn eclipse:eclipse, then import existing general projects or import existing maven projects, it does not recognize the project as a scala project. I am adding a new plugin , so import works -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4275) ./sbt/sbt assembly command fails if path has space in the name
[ https://issues.apache.org/jira/browse/SPARK-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201855#comment-14201855 ] Apache Spark commented on SPARK-4275: - User 'shuhuai007' has created a pull request for this issue: https://github.com/apache/spark/pull/3156 ./sbt/sbt assembly command fails if path has space in the name Key: SPARK-4275 URL: https://issues.apache.org/jira/browse/SPARK-4275 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Ravi Kiran Priority: Trivial I have downloaded branch-1.1 for building spark from scratch on my MAC. The path had a space like /Users/rkgurram/VirtualBox VMs/SPARK/spark-branch-1.1, 1) I cd to the above directory 2) Ran ./sbt/sbt assembly The command fails with weird messages -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4275) ./sbt/sbt assembly command fails if path has space in the name
[ https://issues.apache.org/jira/browse/SPARK-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4275. -- Resolution: Duplicate You should report issues against head in general, rather than an older branch. This was already fixed in https://issues.apache.org/jira/browse/SPARK-3337 ./sbt/sbt assembly command fails if path has space in the name Key: SPARK-4275 URL: https://issues.apache.org/jira/browse/SPARK-4275 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Ravi Kiran Priority: Trivial I have downloaded branch-1.1 for building spark from scratch on my MAC. The path had a space like /Users/rkgurram/VirtualBox VMs/SPARK/spark-branch-1.1, 1) I cd to the above directory 2) Ran ./sbt/sbt assembly The command fails with weird messages -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201865#comment-14201865 ] zzc commented on SPARK-2468: Hi, Aaron Davidson, what do you mean that Is it really the case that each of your executors is only using 1 core for its 20GB of RAM? It seems like 5 would be in line with the portion of memory you're using? I try to set spark.storage.memoryFraction and spark.shuffle.memoryFraction from 0.2 to 0.5 before, OOM still occur. Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4275) ./sbt/sbt assembly command fails if path has space in the name
[ https://issues.apache.org/jira/browse/SPARK-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201876#comment-14201876 ] Ravi Kiran commented on SPARK-4275: --- Scott, Thank you, will follow the advise, I am new to the Spark ecosystem and just getting my feet wet. Regards -Ravi ./sbt/sbt assembly command fails if path has space in the name Key: SPARK-4275 URL: https://issues.apache.org/jira/browse/SPARK-4275 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Ravi Kiran Priority: Trivial I have downloaded branch-1.1 for building spark from scratch on my MAC. The path had a space like /Users/rkgurram/VirtualBox VMs/SPARK/spark-branch-1.1, 1) I cd to the above directory 2) Ran ./sbt/sbt assembly The command fails with weird messages -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4275) ./sbt/sbt assembly command fails if path has space in the name
[ https://issues.apache.org/jira/browse/SPARK-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201876#comment-14201876 ] Ravi Kiran edited comment on SPARK-4275 at 11/7/14 10:13 AM: - Sean, Thank you, will follow the advise, I am new to the Spark ecosystem and just getting my feet wet. Regards -Ravi was (Author: rkgurram): Scott, Thank you, will follow the advise, I am new to the Spark ecosystem and just getting my feet wet. Regards -Ravi ./sbt/sbt assembly command fails if path has space in the name Key: SPARK-4275 URL: https://issues.apache.org/jira/browse/SPARK-4275 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Ravi Kiran Priority: Trivial I have downloaded branch-1.1 for building spark from scratch on my MAC. The path had a space like /Users/rkgurram/VirtualBox VMs/SPARK/spark-branch-1.1, 1) I cd to the above directory 2) Ran ./sbt/sbt assembly The command fails with weird messages -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause
Shixiong Zhu created SPARK-4296: --- Summary: Throw Expression not in GROUP BY when using same expression in group by clause and select clause Key: SPARK-4296 URL: https://issues.apache.org/jira/browse/SPARK-4296 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Shixiong Zhu When the input data has a complex structure, using same expression in group by clause and select clause will throw Expression not in GROUP BY. {code:java} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Birthday(date: String) case class Person(name: String, birthday: Birthday) val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), Person(Jim, Birthday(1980-02-28 people.registerTempTable(people) val year = sqlContext.sql(select count(*), upper(birthday.date) from people group by upper(birthday.date)) year.collect {code} Here is the plan of year: {code:java} SchemaRDD[3] at RDD at SchemaRDD.scala:105 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date AS date#9) AS c1#3] Subquery people LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:36 {code} The bug is the equality test for `Upper(birthday#1.date)` and `Upper(birthday#1.date AS date#9)`. Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201934#comment-14201934 ] Shixiong Zhu edited comment on SPARK-4296 at 11/7/14 11:21 AM: --- Stack trace: {code:java} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: Upper(birthday#11.date AS date#17) AS c1#13, tree: Aggregate [Upper(birthday#11.date)], [COUNT(1) AS c0#12L,Upper(birthday#11.date AS date#17) AS c1#13] Subquery people LogicalRDD [name#10,birthday#11], MapPartitionsRDD[5] at mapPartitions at ExistingRDD.scala:36 at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:133) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:130) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:130) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:115) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:115) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:113) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) {code} was (Author: zsxwing): Stack trace: {code:java} Aggregate [Upper(birthday#11.date)], [COUNT(1) AS c0#12L,Upper(birthday#11.date AS date#17) AS c1#13] Subquery people LogicalRDD [name#10,birthday#11], MapPartitionsRDD[5] at mapPartitions at ExistingRDD.scala:36 at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:133) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:130) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:130) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:115) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:115) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:113) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) {code} Throw Expression not in GROUP BY when using same expression in group by clause and select clause --- Key: SPARK-4296 URL: https://issues.apache.org/jira/browse/SPARK-4296 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Shixiong Zhu
[jira] [Commented] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201934#comment-14201934 ] Shixiong Zhu commented on SPARK-4296: - Stack trace: {code:java} Aggregate [Upper(birthday#11.date)], [COUNT(1) AS c0#12L,Upper(birthday#11.date AS date#17) AS c1#13] Subquery people LogicalRDD [name#10,birthday#11], MapPartitionsRDD[5] at mapPartitions at ExistingRDD.scala:36 at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:133) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:130) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:130) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:115) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:115) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:113) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) {code} Throw Expression not in GROUP BY when using same expression in group by clause and select clause --- Key: SPARK-4296 URL: https://issues.apache.org/jira/browse/SPARK-4296 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Shixiong Zhu When the input data has a complex structure, using same expression in group by clause and select clause will throw Expression not in GROUP BY. {code:java} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Birthday(date: String) case class Person(name: String, birthday: Birthday) val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), Person(Jim, Birthday(1980-02-28 people.registerTempTable(people) val year = sqlContext.sql(select count(*), upper(birthday.date) from people group by upper(birthday.date)) year.collect {code} Here is the plan of year: {code:java} SchemaRDD[3] at RDD at SchemaRDD.scala:105 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date AS date#9) AS c1#3] Subquery people LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:36 {code} The bug is the equality test for `Upper(birthday#1.date)` and `Upper(birthday#1.date AS date#9)`. Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201937#comment-14201937 ] Shixiong Zhu commented on SPARK-4296: - Original reported by Tridib Samanta at http://apache-spark-user-list.1001560.n3.nabble.com/sql-group-by-on-UDF-not-working-td18339.html Throw Expression not in GROUP BY when using same expression in group by clause and select clause --- Key: SPARK-4296 URL: https://issues.apache.org/jira/browse/SPARK-4296 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Shixiong Zhu When the input data has a complex structure, using same expression in group by clause and select clause will throw Expression not in GROUP BY. {code:java} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Birthday(date: String) case class Person(name: String, birthday: Birthday) val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), Person(Jim, Birthday(1980-02-28 people.registerTempTable(people) val year = sqlContext.sql(select count(*), upper(birthday.date) from people group by upper(birthday.date)) year.collect {code} Here is the plan of year: {code:java} SchemaRDD[3] at RDD at SchemaRDD.scala:105 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date AS date#9) AS c1#3] Subquery people LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:36 {code} The bug is the equality test for `Upper(birthday#1.date)` and `Upper(birthday#1.date AS date#9)`. Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
[ https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201959#comment-14201959 ] Tsuyoshi OZAWA commented on SPARK-4267: --- [~sandyr] [~pwendell] do you have any workarounds to deal with this problem? Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later -- Key: SPARK-4267 URL: https://issues.apache.org/jira/browse/SPARK-4267 Project: Spark Issue Type: Bug Reporter: Tsuyoshi OZAWA Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this: {code} ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0 {code} Then Spark on YARN fails to launch jobs with NPE. {code} $ bin/spark-shell --master yarn-client scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2); java.lang.NullPointerException at org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284) at org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291) at org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480) at $iwC$$iwC$$iwC$$iwC.init(console:13) at $iwC$$iwC$$iwC.init(console:18) at $iwC$$iwC.init(console:20) at $iwC.init(console:22) at init(console:24) at .init(console:28) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201990#comment-14201990 ] Yu Ishikawa commented on SPARK-2429: Hi [~rnowling], I have a suggestion to you about new function. I think it is difficult for this algorithm to have an advantage in computational complexity. So I implemented a function to cut a cluster tree as a result of clustering by height. This function restructures a cluster tree, not changing the original cluster tree. We can control the number of clusters in a cluster tree by height without recomputation. This is an advantage against KMeans and other clustering algorighms. You can see a test code at below URL. [https://github.com/yu-iskw/spark/blob/8355f959f02ca67454c9cb070912480db0a44671/mllib/src/test/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModelSuite.scala#L116] Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4297) Build warning fixes omnibus
Sean Owen created SPARK-4297: Summary: Build warning fixes omnibus Key: SPARK-4297 URL: https://issues.apache.org/jira/browse/SPARK-4297 Project: Spark Issue Type: Improvement Components: Build, Java API Affects Versions: 1.1.0 Reporter: Sean Owen Priority: Minor There are a number of warnings generated in a normal, successful build right now. They're mostly Java unchecked cast warnings, which can be suppressed. But there's a grab bag of other Scala language warnings and so on that can all be easily fixed. The forthcoming PR fixes about 90% of the build warnings I see now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4297) Build warning fixes omnibus
[ https://issues.apache.org/jira/browse/SPARK-4297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202030#comment-14202030 ] Apache Spark commented on SPARK-4297: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/3157 Build warning fixes omnibus --- Key: SPARK-4297 URL: https://issues.apache.org/jira/browse/SPARK-4297 Project: Spark Issue Type: Improvement Components: Build, Java API Affects Versions: 1.1.0 Reporter: Sean Owen Priority: Minor There are a number of warnings generated in a normal, successful build right now. They're mostly Java unchecked cast warnings, which can be suppressed. But there's a grab bag of other Scala language warnings and so on that can all be easily fixed. The forthcoming PR fixes about 90% of the build warnings I see now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202056#comment-14202056 ] Lianhui Wang commented on SPARK-2468: - [~adav] yes,with https://github.com/apache/spark/pull/3155/ in my test it does not happened.but i discover that Netty's performance is not good than NioBlockTransferService. so I need to find why Netty's performance is bad than NioBlockTransferService in my test.Can you give me some suggestions? thanks.and how about your test? [~zzcclp] Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202056#comment-14202056 ] Lianhui Wang edited comment on SPARK-2468 at 11/7/14 2:01 PM: -- [~adav] yes,with https://github.com/apache/spark/pull/3155/ in my test beyond physical memory limits does not happened.but i discover that Netty's performance is not good than NioBlockTransferService. so I need to find why Netty's performance is bad than NioBlockTransferService in my test.Can you give me some suggestions? thanks.and how about your test? [~zzcclp] was (Author: lianhuiwang): [~adav] yes,with https://github.com/apache/spark/pull/3155/ in my test it does not happened.but i discover that Netty's performance is not good than NioBlockTransferService. so I need to find why Netty's performance is bad than NioBlockTransferService in my test.Can you give me some suggestions? thanks.and how about your test? [~zzcclp] Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4298) The spark-submit cannot read Main-Class from Manifest.
Milan Straka created SPARK-4298: --- Summary: The spark-submit cannot read Main-Class from Manifest. Key: SPARK-4298 URL: https://issues.apache.org/jira/browse/SPARK-4298 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: Linux spark-1.1.0-bin-hadoop2.4.tgz java version 1.7.0_72 Java(TM) SE Runtime Environment (build 1.7.0_72-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode) Reporter: Milan Straka Consider trivial {{test.scala}}: {code:title=test.scala|borderStyle=solid} import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object Main { def main(args: Array[String]) { val sc = new SparkContext() sc.stop() } } {code} When built with {{sbt}} and executed using {{spark-submit target/scala-2.10/test_2.10-1.0.jar}}, I get the following error: {code} Spark assembly has been built with Hive, including Datanucleus jars on classpath Error: Cannot load main class from JAR: file:/ha/home/straka/s/target/scala-2.10/test_2.10-1.0.jar Run with --help for usage help or --verbose for debug output {code} When executed using {{spark-submit --class Main target/scala-2.10/test_2.10-1.0.jar}}, it works. The jar file has correct MANIFEST.MF: {code:title=MANIFEST.MF|borderStyle=solid} Manifest-Version: 1.0 Implementation-Vendor: test Implementation-Title: test Implementation-Version: 1.0 Implementation-Vendor-Id: test Specification-Vendor: test Specification-Title: test Specification-Version: 1.0 Main-Class: Main {code} The problem is that in {{org.apache.spark.deploy.SparkSubmitArguments}}, line 127: {code} val jar = new JarFile(primaryResource) {code} the primaryResource has String value {{file:/ha/home/straka/s/target/scala-2.10/test_2.10-1.0.jar}}, which is URI, but JarFile can use only Path. One way to fix this would be using {code} val uri = new URI(primaryResource) val jar = new JarFile(uri.getPath) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202136#comment-14202136 ] zzc commented on SPARK-2468: The performance of Netty is worse than NIO in my test. Why?@Aaron Davidson. I want to improve the performance of shuffle, with 500G of shuffle data, the performance is more worse than hadoop. Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4205) Timestamp and Date objects with comparison operators
[ https://issues.apache.org/jira/browse/SPARK-4205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202251#comment-14202251 ] Apache Spark commented on SPARK-4205: - User 'culler' has created a pull request for this issue: https://github.com/apache/spark/pull/3158 Timestamp and Date objects with comparison operators Key: SPARK-4205 URL: https://issues.apache.org/jira/browse/SPARK-4205 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Marc Culler Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4300) Race condition during SparkWorker shutdown
Alex Liu created SPARK-4300: --- Summary: Race condition during SparkWorker shutdown Key: SPARK-4300 URL: https://issues.apache.org/jira/browse/SPARK-4300 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.1.0 Reporter: Alex Liu Priority: Minor When a shark job is done. there are some error message as following show in the log {code} INFO 22:10:41,635 SparkMaster: akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got disassociated, removing it. INFO 22:10:41,640 SparkMaster: Removing app app-20141106221014- INFO 22:10:41,687 SparkMaster: Removing application Shark::ip-172-31-11-204.us-west-1.compute.internal INFO 22:10:41,710 SparkWorker: Asked to kill executor app-20141106221014-/0 INFO 22:10:41,712 SparkWorker: Runner thread for executor app-20141106221014-/0 interrupted INFO 22:10:41,714 SparkWorker: Killing process! ERROR 22:10:41,738 SparkWorker: Error writing stream to file /var/lib/spark/work/app-20141106221014-/0/stdout ERROR 22:10:41,739 SparkWorker: java.io.IOException: Stream closed ERROR 22:10:41,739 SparkWorker: at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162) ERROR 22:10:41,740 SparkWorker: at java.io.BufferedInputStream.read1(BufferedInputStream.java:272) ERROR 22:10:41,740 SparkWorker: at java.io.BufferedInputStream.read(BufferedInputStream.java:334) ERROR 22:10:41,740 SparkWorker: at java.io.FilterInputStream.read(FilterInputStream.java:107) ERROR 22:10:41,741 SparkWorker: at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) ERROR 22:10:41,741 SparkWorker: at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) ERROR 22:10:41,741 SparkWorker: at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) ERROR 22:10:41,742 SparkWorker: at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) ERROR 22:10:41,742 SparkWorker: at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) ERROR 22:10:41,742 SparkWorker: at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) INFO 22:10:41,838 SparkMaster: Connected to Cassandra cluster: 4299 INFO 22:10:41,839 SparkMaster: Adding host 172.31.11.204 (Analytics) INFO 22:10:41,840 SparkMaster: New Cassandra host /172.31.11.204:9042 added INFO 22:10:41,841 SparkMaster: Adding host 172.31.11.204 (Analytics) INFO 22:10:41,842 SparkMaster: Adding host 172.31.11.204 (Analytics) INFO 22:10:41,852 SparkMaster: akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got disassociated, removing it. INFO 22:10:41,853 SparkMaster: akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got disassociated, removing it. INFO 22:10:41,853 SparkMaster: akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got disassociated, removing it. INFO 22:10:41,857 SparkMaster: akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got disassociated, removing it. INFO 22:10:41,862 SparkMaster: Adding host 172.31.11.204 (Analytics) WARN 22:10:42,200 SparkMaster: Got status update for unknown executor app-20141106221014-/0 INFO 22:10:42,211 SparkWorker: Executor app-20141106221014-/0 finished with state KILLED exitStatus 143 {code} /var/lib/spark/work/app-20141106221014-/0/stdout is on the disk. It is trying to write to a close IO stream. Spark worker shuts down by {code} private def killProcess(message: Option[String]) { var exitCode: Option[Int] = None logInfo(Killing process!) process.destroy() process.waitFor() if (stdoutAppender != null) { stdoutAppender.stop() } if (stderrAppender != null) { stderrAppender.stop() } if (process != null) { exitCode = Some(process.waitFor()) } worker ! ExecutorStateChanged(appId, execId, state, message, exitCode) {code} But stdoutAppender concurrently writes to output log file, which creates race condition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
[ https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202319#comment-14202319 ] Sandy Ryza commented on SPARK-4267: --- Strange. Checked in the code and it seems like this must mean the taskScheduler is null. Did you see any errors farther up in the shell before this happened? Does it work in local mode? Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later -- Key: SPARK-4267 URL: https://issues.apache.org/jira/browse/SPARK-4267 Project: Spark Issue Type: Bug Reporter: Tsuyoshi OZAWA Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this: {code} ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0 {code} Then Spark on YARN fails to launch jobs with NPE. {code} $ bin/spark-shell --master yarn-client scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2); java.lang.NullPointerException at org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284) at org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291) at org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480) at $iwC$$iwC$$iwC$$iwC.init(console:13) at $iwC$$iwC$$iwC.init(console:18) at $iwC$$iwC.init(console:20) at $iwC.init(console:22) at init(console:24) at .init(console:28) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
[jira] [Commented] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202326#comment-14202326 ] Tridib Samanta commented on SPARK-4296: --- I wish we can use alias of calculated column in group by clause, which will avoid specifying long calculated fields to be repeated. Throw Expression not in GROUP BY when using same expression in group by clause and select clause --- Key: SPARK-4296 URL: https://issues.apache.org/jira/browse/SPARK-4296 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Shixiong Zhu When the input data has a complex structure, using same expression in group by clause and select clause will throw Expression not in GROUP BY. {code:java} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Birthday(date: String) case class Person(name: String, birthday: Birthday) val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), Person(Jim, Birthday(1980-02-28 people.registerTempTable(people) val year = sqlContext.sql(select count(*), upper(birthday.date) from people group by upper(birthday.date)) year.collect {code} Here is the plan of year: {code:java} SchemaRDD[3] at RDD at SchemaRDD.scala:105 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date AS date#9) AS c1#3] Subquery people LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:36 {code} The bug is the equality test for `Upper(birthday#1.date)` and `Upper(birthday#1.date AS date#9)`. Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4280) In dynamic allocation, add option to never kill executors with cached blocks
[ https://issues.apache.org/jira/browse/SPARK-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202382#comment-14202382 ] Sandy Ryza commented on SPARK-4280: --- So it looks like the block IDs of broadcast variables on each node are the same broadcast IDs used on the driver. Which means it wouldn't be too hard to do this filtering. Even without it, this would still be useful. What do you think? In dynamic allocation, add option to never kill executors with cached blocks Key: SPARK-4280 URL: https://issues.apache.org/jira/browse/SPARK-4280 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza Even with the external shuffle service, this is useful in situations like Hive on Spark where a query might require caching some data. We want to be able to give back executors after the job ends, but not during the job if it would delete intermediate results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab
[ https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202420#comment-14202420 ] Josh Rosen commented on SPARK-4216: --- A large part of the problem is that the Jenkins GHPRB plugin has a lot of settings that are global rather than per-project. In this case, I think the duplicate postings are being generated by the fall back on posting comments in case the GitHub commit status API call fails. We can't use the status API in Spark, but I guess the other AMP Lab projects used to use it and didn't require this fallback. At some point, I think we switched the comment fallback on because some other project needed it, leading to these duplicate updates. As I've commented elsewhere, one solution would be to simply not use the GHPRB plugin for Spark and instead use a parameterized build that's triggered remotely (e.g. through spark-prs.appspot.com). I think that we could easily build this layer on top of spark-prs; it's just a matter of finding the time to do it (and to add the necessary features, like automatic detection of when new commits have been pushed, listening to commands addressed to Jenkins, etc.) I already have the triggering working manually (this runs NewSparkPullRequestBuilder), so the only remaining piece is the automatic triggering / ACLs. Eliminate duplicate Jenkins GitHub posts from AMPLab Key: SPARK-4216 URL: https://issues.apache.org/jira/browse/SPARK-4216 Project: Spark Issue Type: Bug Components: Build, Project Infra Reporter: Nicholas Chammas Priority: Minor * [Real Jenkins | https://github.com/apache/spark/pull/2988#issuecomment-60873361] * [Imposter Jenkins | https://github.com/apache/spark/pull/2988#issuecomment-60873366] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab
[ https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202420#comment-14202420 ] Josh Rosen edited comment on SPARK-4216 at 11/7/14 6:38 PM: A large part of the problem is that the Jenkins GHPRB plugin has a lot of settings that are global rather than per-project. In this case, I think the duplicate postings are being generated by the fall back on posting comments in case the GitHub commit status API call fails setting. We can't use the status API in Spark, but I guess the other AMP Lab projects used to use it and didn't require this fallback. At some point, I think we switched the comment fallback on because some other project needed it, leading to these duplicate updates. As I've commented elsewhere, one solution would be to simply not use the GHPRB plugin for Spark and instead use a parameterized build that's triggered remotely (e.g. through spark-prs.appspot.com). I think that we could easily build this layer on top of spark-prs; it's just a matter of finding the time to do it (and to add the necessary features, like automatic detection of when new commits have been pushed, listening to commands addressed to Jenkins, etc.) I already have the triggering working manually (this runs NewSparkPullRequestBuilder), so the only remaining piece is the automatic triggering / ACLs. was (Author: joshrosen): A large part of the problem is that the Jenkins GHPRB plugin has a lot of settings that are global rather than per-project. In this case, I think the duplicate postings are being generated by the fall back on posting comments in case the GitHub commit status API call fails. We can't use the status API in Spark, but I guess the other AMP Lab projects used to use it and didn't require this fallback. At some point, I think we switched the comment fallback on because some other project needed it, leading to these duplicate updates. As I've commented elsewhere, one solution would be to simply not use the GHPRB plugin for Spark and instead use a parameterized build that's triggered remotely (e.g. through spark-prs.appspot.com). I think that we could easily build this layer on top of spark-prs; it's just a matter of finding the time to do it (and to add the necessary features, like automatic detection of when new commits have been pushed, listening to commands addressed to Jenkins, etc.) I already have the triggering working manually (this runs NewSparkPullRequestBuilder), so the only remaining piece is the automatic triggering / ACLs. Eliminate duplicate Jenkins GitHub posts from AMPLab Key: SPARK-4216 URL: https://issues.apache.org/jira/browse/SPARK-4216 Project: Spark Issue Type: Bug Components: Build, Project Infra Reporter: Nicholas Chammas Priority: Minor * [Real Jenkins | https://github.com/apache/spark/pull/2988#issuecomment-60873361] * [Imposter Jenkins | https://github.com/apache/spark/pull/2988#issuecomment-60873366] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4301) StreamingContext should not allow start() to be called after calling stop()
Josh Rosen created SPARK-4301: - Summary: StreamingContext should not allow start() to be called after calling stop() Key: SPARK-4301 URL: https://issues.apache.org/jira/browse/SPARK-4301 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.0, 1.0.2, 1.0.0, 1.2.0 Reporter: Josh Rosen Assignee: Josh Rosen In Spark 1.0.0+, calling {{stop()}} on a StreamingContext that has not been started is a no-op which has no side-effects. This allows users to call {{stop()}} on a fresh StreamingContext followed by {{start()}}. I believe that this almost always indicates an error and is not behavior that we should support. Since we don't allow {{start() stop() start()}} then I don't think it makes sense to allow {{stop() start()}}. The current behavior can lead to resource leaks when StreamingContext constructs its own SparkContext: if I call {{stop(stopSparkContext=True)}}, then I expect StreamingContext's underlying SparkContext to be stopped irrespective of whether the StreamingContext has been started. This is useful when writing unit test fixtures. Prior discussions: - https://github.com/apache/spark/pull/3053#discussion-diff-19710333R490 - https://github.com/apache/spark/pull/3121#issuecomment-61927353 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202490#comment-14202490 ] Aaron Davidson commented on SPARK-2468: --- [~lianhuiwang] Can you try again with preferDirectBufs set to true, and just setting maxUsableCores down to the number of cores each container actually has? It's possible the performance discrepancy you're seeing is simply due to heap byte buffers not being as fast as direct ones. You might also decrease the Java heap size a bit while keeping the container size the same, if _any_ direct memory allocation is causing the container to be killed. [~zzcclp] Same suggestion for you about setting preferDirectBufs to true and setting maxUsableCores down, but I will also perform another round of benchmarking -- it's possible we accidentally introduced a performance regression in the last few patches. Comparing Hadoop vs Spark performance is a different matter. A few suggestions on your setup: You should set executor-cores to 5, so that each executor is actually using 5 cores instead of just 1. You're losing significant parallelism because of this setting, as Spark will only launch 1 task per core on an executor at any given time. Second, groupBy() is inefficient (it's doc was changed recently to reflect this), and should be avoided. I would recommend changing your job to sort the whole RDD using something similar to {code}mapR.map { x = ((x._1._1, x._2._1), x) }.sortByKey(){code}, which would not require that all values for a single group fit in memory. This would still effectively group by x._1._1, but would sort within each group by x._2._1, and would utilize Spark's efficient sorting machinery. Netty-based block server / client module Key: SPARK-2468 URL: https://issues.apache.org/jira/browse/SPARK-2468 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.2.0 Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/ One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4301) StreamingContext should not allow start() to be called after calling stop()
[ https://issues.apache.org/jira/browse/SPARK-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202503#comment-14202503 ] Apache Spark commented on SPARK-4301: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/3160 StreamingContext should not allow start() to be called after calling stop() --- Key: SPARK-4301 URL: https://issues.apache.org/jira/browse/SPARK-4301 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0, 1.0.2, 1.1.0, 1.2.0 Reporter: Josh Rosen Assignee: Josh Rosen In Spark 1.0.0+, calling {{stop()}} on a StreamingContext that has not been started is a no-op which has no side-effects. This allows users to call {{stop()}} on a fresh StreamingContext followed by {{start()}}. I believe that this almost always indicates an error and is not behavior that we should support. Since we don't allow {{start() stop() start()}} then I don't think it makes sense to allow {{stop() start()}}. The current behavior can lead to resource leaks when StreamingContext constructs its own SparkContext: if I call {{stop(stopSparkContext=True)}}, then I expect StreamingContext's underlying SparkContext to be stopped irrespective of whether the StreamingContext has been started. This is useful when writing unit test fixtures. Prior discussions: - https://github.com/apache/spark/pull/3053#discussion-diff-19710333R490 - https://github.com/apache/spark/pull/3121#issuecomment-61927353 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3337) Paranoid quoting in shell to allow install dirs with spaces within.
[ https://issues.apache.org/jira/browse/SPARK-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3337: -- Fix Version/s: (was: 1.1.1) 1.2.0 Looks like the Fix versions are wrong here, since this patch only made it into master / 1.2.0, so I'm removing 1.1.1 as a Fix version and adding 1.2.0. Paranoid quoting in shell to allow install dirs with spaces within. --- Key: SPARK-3337 URL: https://issues.apache.org/jira/browse/SPARK-3337 Project: Spark Issue Type: Improvement Components: Build, Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Prashant Sharma Assignee: Prashant Sharma Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab
[ https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202518#comment-14202518 ] shane knapp commented on SPARK-4216: yep, not running ghprb for spark is a totally legit option as well (which i'd forgotten about -- this was something we'd spoken about josh). just be aware that you're adding a new layer of tooling, which is fine, but it will need to be documented, reviewed and support. i can help support amplab-based stuff (ie: things on our end), but once we're adding in things like remote triggers from appspot, i'll need to draw a support line. :) @nicholas -- those example you showed me are from when the amplab jenkins bot was broken, and not posting. btw, i turned down the number of amplab jenkins bot posts a while back to a minimum, so as not to spam spark builds. so, we: 1) we carry on w/the duplicate postings (annoying, but not dangerous) 2) spark starts using it's own bot/trigger system (needs a lot of work) (1) for now, (2) when you guys can find some time to make it happen? Eliminate duplicate Jenkins GitHub posts from AMPLab Key: SPARK-4216 URL: https://issues.apache.org/jira/browse/SPARK-4216 Project: Spark Issue Type: Bug Components: Build, Project Infra Reporter: Nicholas Chammas Priority: Minor * [Real Jenkins | https://github.com/apache/spark/pull/2988#issuecomment-60873361] * [Imposter Jenkins | https://github.com/apache/spark/pull/2988#issuecomment-60873366] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4225) jdbc/odbc error when using maven build spark
[ https://issues.apache.org/jira/browse/SPARK-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4225. - Resolution: Fixed Issue resolved by pull request 3105 [https://github.com/apache/spark/pull/3105] jdbc/odbc error when using maven build spark Key: SPARK-4225 URL: https://issues.apache.org/jira/browse/SPARK-4225 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 1.1.0 Reporter: wangfei Assignee: Cheng Lian Priority: Blocker Fix For: 1.2.0 use command as follows to build spark mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.1 -Phive -DskipTests clean package then use beeline to connect to thrift server ,get this error: 14/11/04 11:30:31 INFO ObjectStore: Initialized ObjectStore 14/11/04 11:30:31 INFO AbstractService: Service:ThriftBinaryCLIService is started. 14/11/04 11:30:31 INFO AbstractService: Service:HiveServer2 is started. 14/11/04 11:30:31 INFO HiveThriftServer2: HiveThriftServer2 started 14/11/04 11:30:31 INFO ThriftCLIService: ThriftBinaryCLIService listening on 0.0.0.0/0.0.0.0:1 14/11/04 11:33:26 INFO ThriftCLIService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V6 14/11/04 11:33:26 INFO HiveMetaStore: No user is added in admin role, since config is empty 14/11/04 11:33:26 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 14/11/04 11:33:26 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 14/11/04 11:33:26 ERROR TThreadPoolServer: Thrift error occurred during processing of message. org.apache.thrift.protocol.TProtocolException: Cannot write a TUnion with no set value! at org.apache.thrift.TUnion$TUnionStandardScheme.write(TUnion.java:240) at org.apache.thrift.TUnion$TUnionStandardScheme.write(TUnion.java:213) at org.apache.thrift.TUnion.write(TUnion.java:152) at org.apache.hive.service.cli.thrift.TGetInfoResp$TGetInfoRespStandardScheme.write(TGetInfoResp.java:456) at org.apache.hive.service.cli.thrift.TGetInfoResp$TGetInfoRespStandardScheme.write(TGetInfoResp.java:406) at org.apache.hive.service.cli.thrift.TGetInfoResp.write(TGetInfoResp.java:341) at org.apache.hive.service.cli.thrift.TCLIService$GetInfo_result$GetInfo_resultStandardScheme.write(TCLIService.java:3754) at org.apache.hive.service.cli.thrift.TCLIService$GetInfo_result$GetInfo_resultStandardScheme.write(TCLIService.java:3718) at org.apache.hive.service.cli.thrift.TCLIService$GetInfo_result.write(TCLIService.java:3669) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:53) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202548#comment-14202548 ] Norman He commented on SPARK-2447: -- Hi Ted, I have already made some changes in scala for facading and added some tests. Let us discuss early next week. How should I send you the code reviews? -Norman Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab
[ https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202553#comment-14202553 ] Nicholas Chammas commented on SPARK-4216: - {quote} 1) we carry on w/the duplicate postings (annoying, but not dangerous) 2) spark starts using it's own bot/trigger system (needs a lot of work) (1) for now, (2) when you guys can find some time to make it happen? {quote} Seems sensible to me. Long term, (2) seems like the right thing to do for Spark if we're gonna stay on the AMPLab cluster. Eliminate duplicate Jenkins GitHub posts from AMPLab Key: SPARK-4216 URL: https://issues.apache.org/jira/browse/SPARK-4216 Project: Spark Issue Type: Bug Components: Build, Project Infra Reporter: Nicholas Chammas Priority: Minor * [Real Jenkins | https://github.com/apache/spark/pull/2988#issuecomment-60873361] * [Imposter Jenkins | https://github.com/apache/spark/pull/2988#issuecomment-60873366] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4213) SparkSQL - ParquetFilters - No support for LT, LTE, GT, GTE operators
[ https://issues.apache.org/jira/browse/SPARK-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4213. - Resolution: Fixed Issue resolved by pull request 3083 [https://github.com/apache/spark/pull/3083] SparkSQL - ParquetFilters - No support for LT, LTE, GT, GTE operators - Key: SPARK-4213 URL: https://issues.apache.org/jira/browse/SPARK-4213 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Environment: CDH5.2, Hive 0.13.1, Spark 1.2 snapshot (commit hash 76386e1a23c) Reporter: Terry Siu Priority: Blocker Fix For: 1.2.0 When I issue a hql query against a HiveContext where my predicate uses a column of string type with one of LT, LTE, GT, or GTE operator, I get the following error: scala.MatchError: StringType (of class org.apache.spark.sql.catalyst.types.StringType$) Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, StringType is absent from the corresponding functions for creating these filters. To reproduce, in a Hive 0.13.1 shell, I created the following table (at a specified DB): create table sparkbug ( id int, event string ) stored as parquet; Insert some sample data: insert into table sparkbug select 1, '2011-06-18' from some table limit 1; insert into table sparkbug select 2, '2012-01-01' from some table limit 1; Launch a spark shell and create a HiveContext to the metastore where the table above is located. import org.apache.spark.sql._ import org.apache.spark.sql.SQLContext import org.apache.spark.sql.hive.HiveContext val hc = new HiveContext(sc) hc.setConf(spark.sql.shuffle.partitions, 10) hc.setConf(spark.sql.hive.convertMetastoreParquet, true) hc.setConf(spark.sql.parquet.compression.codec, snappy) import hc._ hc.hql(select * from db.sparkbug where event = '2011-12-01') A scala.MatchError will appear in the output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4272) Add more unwrap functions for primitive type in TableReader
[ https://issues.apache.org/jira/browse/SPARK-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4272. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3136 [https://github.com/apache/spark/pull/3136] Add more unwrap functions for primitive type in TableReader --- Key: SPARK-4272 URL: https://issues.apache.org/jira/browse/SPARK-4272 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor Fix For: 1.2.0 Currently, the data unwrap only support couple of primitive types, not all, it will not cause exception, but may get some performance in table scanning for the type like binary, date, timestamp, decimal etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4203) Partition directories in random order when inserting into hive table
[ https://issues.apache.org/jira/browse/SPARK-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4203. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3076 [https://github.com/apache/spark/pull/3076] Partition directories in random order when inserting into hive table Key: SPARK-4203 URL: https://issues.apache.org/jira/browse/SPARK-4203 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.2.0 Reporter: Matthew Taylor Fix For: 1.2.0 When doing an insert into hive table with partitions the folders written to the file system are in a random order instead of the order defined in table creation. Seems that the loadPartition method in Hive.java has a MapString,String parameter but expects to be called with a map that has a defined ordering such as LinkedHashMap. Have a patch which I will do a PR for. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4292) incorrect result set in JDBC/ODBC
[ https://issues.apache.org/jira/browse/SPARK-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4292. - Resolution: Fixed Issue resolved by pull request 3149 [https://github.com/apache/spark/pull/3149] incorrect result set in JDBC/ODBC - Key: SPARK-4292 URL: https://issues.apache.org/jira/browse/SPARK-4292 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.2.0 select * from src, get result as follows: | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | | 97 | val_97 | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4302) Make jsonRDD/jsonFile support more field data types
Yin Huai created SPARK-4302: --- Summary: Make jsonRDD/jsonFile support more field data types Key: SPARK-4302 URL: https://issues.apache.org/jira/browse/SPARK-4302 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Yin Huai Since we allow users to specify schemas, jsonRDD/jsonFile should support all Spark SQL data types in the provided schema. A related post in mailing list: http://apache-spark-user-list.1001560.n3.nabble.com/jsonRdd-and-MapType-td18376.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4303) [MLLIB] Use Long IDs instead of Int in ALS.Rating class
Jia Xu created SPARK-4303: - Summary: [MLLIB] Use Long IDs instead of Int in ALS.Rating class Key: SPARK-4303 URL: https://issues.apache.org/jira/browse/SPARK-4303 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Jia Xu In many big data recommendation applications, the IDs used are usually Long type instead of Integer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4303) [MLLIB] Use Long IDs instead of Int in ALS.Rating class
[ https://issues.apache.org/jira/browse/SPARK-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jia Xu updated SPARK-4303: -- Description: In many big data recommendation applications, the IDs used for users and products are usually Long type instead of Integer. (was: In many big data recommendation applications, the IDs used are usually Long type instead of Integer. ) [MLLIB] Use Long IDs instead of Int in ALS.Rating class --- Key: SPARK-4303 URL: https://issues.apache.org/jira/browse/SPARK-4303 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Jia Xu In many big data recommendation applications, the IDs used for users and products are usually Long type instead of Integer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4303) [MLLIB] Use Long IDs instead of Int in ALS.Rating class
[ https://issues.apache.org/jira/browse/SPARK-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jia Xu updated SPARK-4303: -- Description: In many big data recommendation applications, the IDs used for users and products are usually Long type instead of Integer. So a Rating class based on Long IDs should be more useful for these applications. i.e. case class Rating(val user: Long, val product: Long, val rating: Double) was: In many big data recommendation applications, the IDs used for users and products are usually Long type instead of Integer. So a Rating class based on Long IDs should be more useful for these applications. case class Rating(val user: Long, val product: Long, val rating: Double) [MLLIB] Use Long IDs instead of Int in ALS.Rating class --- Key: SPARK-4303 URL: https://issues.apache.org/jira/browse/SPARK-4303 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Jia Xu In many big data recommendation applications, the IDs used for users and products are usually Long type instead of Integer. So a Rating class based on Long IDs should be more useful for these applications. i.e. case class Rating(val user: Long, val product: Long, val rating: Double) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4303) [MLLIB] Use Long IDs instead of Int in ALS.Rating class
[ https://issues.apache.org/jira/browse/SPARK-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jia Xu updated SPARK-4303: -- Description: In many big data recommendation applications, the IDs used for users and products are usually Long type instead of Integer. So a Rating class based on Long IDs should be more useful for these applications. case class Rating(val user: Long, val product: Long, val rating: Double) was:In many big data recommendation applications, the IDs used for users and products are usually Long type instead of Integer. [MLLIB] Use Long IDs instead of Int in ALS.Rating class --- Key: SPARK-4303 URL: https://issues.apache.org/jira/browse/SPARK-4303 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Jia Xu In many big data recommendation applications, the IDs used for users and products are usually Long type instead of Integer. So a Rating class based on Long IDs should be more useful for these applications. case class Rating(val user: Long, val product: Long, val rating: Double) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2381) streaming receiver crashed,but seems nothing happened
[ https://issues.apache.org/jira/browse/SPARK-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202798#comment-14202798 ] Apache Spark commented on SPARK-2381: - User 'joyyoj' has created a pull request for this issue: https://github.com/apache/spark/pull/1693 streaming receiver crashed,but seems nothing happened - Key: SPARK-2381 URL: https://issues.apache.org/jira/browse/SPARK-2381 Project: Spark Issue Type: Bug Components: Streaming Reporter: sunsc when we submit a streaming job and if receivers doesn't start normally, the application should stop itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2360) CSV import to SchemaRDDs
[ https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202805#comment-14202805 ] Hossein Falaki commented on SPARK-2360: --- Sure. CSV import to SchemaRDDs Key: SPARK-2360 URL: https://issues.apache.org/jira/browse/SPARK-2360 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Hossein Falaki I think the first step it to design the interface that we want to present to users. Mostly this is defining options when importing. Off the top of my head: - What is the separator? - Provide column names or infer them from the first row. - how to handle multiple files with possibly different schemas - do we have a method to let users specify the datatypes of the columns or are they just strings? - what types of quoting / escaping do we want to support? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2360) CSV import to SchemaRDDs
[ https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki closed SPARK-2360. - This will be a package using Data Source API CSV import to SchemaRDDs Key: SPARK-2360 URL: https://issues.apache.org/jira/browse/SPARK-2360 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Hossein Falaki I think the first step it to design the interface that we want to present to users. Mostly this is defining options when importing. Off the top of my head: - What is the separator? - Provide column names or infer them from the first row. - how to handle multiple files with possibly different schemas - do we have a method to let users specify the datatypes of the columns or are they just strings? - what types of quoting / escaping do we want to support? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2447: - Target Version/s: (was: 1.2.0) Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3754) Spark Streaming fileSystem API is not callable from Java
[ https://issues.apache.org/jira/browse/SPARK-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-3754: - Target Version/s: 1.2.0 Spark Streaming fileSystem API is not callable from Java Key: SPARK-3754 URL: https://issues.apache.org/jira/browse/SPARK-3754 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0, 1.1.0 Reporter: holdenk Assignee: Holden Karau The Spark Streaming Java API for fileSystem is not callable from Java. We should do something like with how it is handled in the Java Spark Context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3754) Spark Streaming fileSystem API is not callable from Java
[ https://issues.apache.org/jira/browse/SPARK-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-3754: - Priority: Critical (was: Major) Spark Streaming fileSystem API is not callable from Java Key: SPARK-3754 URL: https://issues.apache.org/jira/browse/SPARK-3754 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0, 1.1.0 Reporter: holdenk Assignee: Holden Karau Priority: Critical The Spark Streaming Java API for fileSystem is not callable from Java. We should do something like with how it is handled in the Java Spark Context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3754) Spark Streaming fileSystem API is not callable from Java
[ https://issues.apache.org/jira/browse/SPARK-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-3754: - Affects Version/s: 1.0.0 1.1.0 Spark Streaming fileSystem API is not callable from Java Key: SPARK-3754 URL: https://issues.apache.org/jira/browse/SPARK-3754 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0, 1.1.0 Reporter: holdenk Assignee: Holden Karau Priority: Critical The Spark Streaming Java API for fileSystem is not callable from Java. We should do something like with how it is handled in the Java Spark Context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4303) [MLLIB] Use Long IDs instead of Int in ALS.Rating class
[ https://issues.apache.org/jira/browse/SPARK-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4303. -- Resolution: Duplicate I think this is an exact duplicate of https://issues.apache.org/jira/browse/SPARK-2465. See the discussion about why this won't be committed in the foreseeable future. I agree there are some arguments for long but fair enough there are downsides too. [MLLIB] Use Long IDs instead of Int in ALS.Rating class --- Key: SPARK-4303 URL: https://issues.apache.org/jira/browse/SPARK-4303 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Jia Xu In many big data recommendation applications, the IDs used for users and products are usually Long type instead of Integer. So a Rating class based on Long IDs should be more useful for these applications. i.e. case class Rating(val user: Long, val product: Long, val rating: Double) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4304) sortByKey() will fail on empty RDD
Davies Liu created SPARK-4304: - Summary: sortByKey() will fail on empty RDD Key: SPARK-4304 URL: https://issues.apache.org/jira/browse/SPARK-4304 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0, 1.0.2, 1.2.0 Reporter: Davies Liu Priority: Blocker {code} sc.parallelize(zip(range(4), range(0)), 5).sortByKey().count() Traceback (most recent call last): File stdin, line 1, in module File /Users/davies/work/spark/python/pyspark/rdd.py, line 532, in sortByKey for i in range(0, numPartitions - 1)] IndexError: list index out of range {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4304) sortByKey() will fail on empty RDD
[ https://issues.apache.org/jira/browse/SPARK-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203092#comment-14203092 ] Apache Spark commented on SPARK-4304: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3162 sortByKey() will fail on empty RDD -- Key: SPARK-4304 URL: https://issues.apache.org/jira/browse/SPARK-4304 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.2, 1.1.0, 1.2.0 Reporter: Davies Liu Priority: Blocker {code} sc.parallelize(zip(range(4), range(0)), 5).sortByKey().count() Traceback (most recent call last): File stdin, line 1, in module File /Users/davies/work/spark/python/pyspark/rdd.py, line 532, in sortByKey for i in range(0, numPartitions - 1)] IndexError: list index out of range {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4304) sortByKey() will fail on empty RDD
[ https://issues.apache.org/jira/browse/SPARK-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203102#comment-14203102 ] Apache Spark commented on SPARK-4304: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3163 sortByKey() will fail on empty RDD -- Key: SPARK-4304 URL: https://issues.apache.org/jira/browse/SPARK-4304 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.2, 1.1.0, 1.2.0 Reporter: Davies Liu Priority: Blocker {code} sc.parallelize(zip(range(4), range(0)), 5).sortByKey().count() Traceback (most recent call last): File stdin, line 1, in module File /Users/davies/work/spark/python/pyspark/rdd.py, line 532, in sortByKey for i in range(0, numPartitions - 1)] IndexError: list index out of range {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4303) [MLLIB] Use Long IDs instead of Int in ALS.Rating class
[ https://issues.apache.org/jira/browse/SPARK-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203147#comment-14203147 ] Matei Zaharia commented on SPARK-4303: -- Yup, this will actually become easier with the new pipeline API, but it's probably not going to happen in 1.2. [MLLIB] Use Long IDs instead of Int in ALS.Rating class --- Key: SPARK-4303 URL: https://issues.apache.org/jira/browse/SPARK-4303 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Jia Xu In many big data recommendation applications, the IDs used for users and products are usually Long type instead of Integer. So a Rating class based on Long IDs should be more useful for these applications. i.e. case class Rating(val user: Long, val product: Long, val rating: Double) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4289) Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance.
[ https://issues.apache.org/jira/browse/SPARK-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203183#comment-14203183 ] Corey J. Nolet commented on SPARK-4289: --- I suppose we could look @ it as a Hadoop issue- though newing up a Job works fine without the Scala shell doing the toString(). I'd have to dive in deeper to find out why the states seem to be different between the constructor and the toString()- and even more importantly, why it cares... I think :silent will work for the short term. Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance. -- Key: SPARK-4289 URL: https://issues.apache.org/jira/browse/SPARK-4289 Project: Spark Issue Type: Bug Reporter: Corey J. Nolet This one is easy to reproduce. preval job = new Job(sc.hadoopConfiguration)/pre I'm not sure what the solution would be off hand as it's happening when the shell is calling toString() on the instance of Job. The problem is, because of the failure, the instance is never actually assigned to the job val. java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283) at org.apache.hadoop.mapreduce.Job.toString(Job.java:452) at scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324) at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329) at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337) at .init(console:10) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4304) sortByKey() will fail on empty RDD
[ https://issues.apache.org/jira/browse/SPARK-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4304. --- Resolution: Fixed Fix Version/s: 1.0.3 1.1.1 1.2.0 Issue resolved by pull request 3163 [https://github.com/apache/spark/pull/3163] sortByKey() will fail on empty RDD -- Key: SPARK-4304 URL: https://issues.apache.org/jira/browse/SPARK-4304 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.2, 1.1.0, 1.2.0 Reporter: Davies Liu Priority: Blocker Fix For: 1.2.0, 1.1.1, 1.0.3 {code} sc.parallelize(zip(range(4), range(0)), 5).sortByKey().count() Traceback (most recent call last): File stdin, line 1, in module File /Users/davies/work/spark/python/pyspark/rdd.py, line 532, in sortByKey for i in range(0, numPartitions - 1)] IndexError: list index out of range {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4304) sortByKey() will fail on empty RDD
[ https://issues.apache.org/jira/browse/SPARK-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4304: -- Assignee: Davies Liu sortByKey() will fail on empty RDD -- Key: SPARK-4304 URL: https://issues.apache.org/jira/browse/SPARK-4304 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.2, 1.1.0, 1.2.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker Fix For: 1.1.1, 1.2.0, 1.0.3 {code} sc.parallelize(zip(range(4), range(0)), 5).sortByKey().count() Traceback (most recent call last): File stdin, line 1, in module File /Users/davies/work/spark/python/pyspark/rdd.py, line 532, in sortByKey for i in range(0, numPartitions - 1)] IndexError: list index out of range {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4221) Allow access to nonnegative ALS from python
[ https://issues.apache.org/jira/browse/SPARK-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4221. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3095 [https://github.com/apache/spark/pull/3095] Allow access to nonnegative ALS from python --- Key: SPARK-4221 URL: https://issues.apache.org/jira/browse/SPARK-4221 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Michelangelo D'Agostino Assignee: Michelangelo D'Agostino Fix For: 1.2.0 SPARK-1553 added alternating nonnegative least squares to MLLib, however it's not possible to access it via the python API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4221) Allow access to nonnegative ALS from python
[ https://issues.apache.org/jira/browse/SPARK-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4221: - Assignee: Michelangelo D'Agostino Allow access to nonnegative ALS from python --- Key: SPARK-4221 URL: https://issues.apache.org/jira/browse/SPARK-4221 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Michelangelo D'Agostino Assignee: Michelangelo D'Agostino Fix For: 1.2.0 SPARK-1553 added alternating nonnegative least squares to MLLib, however it's not possible to access it via the python API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-3821: Attachment: packer-proposal.html Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203280#comment-14203280 ] Nicholas Chammas commented on SPARK-3821: - After much dilly-dallying, I am happy to present: * A brief proposal / design doc ([fixed JIRA attachment | https://issues.apache.org/jira/secure/attachment/12680371/packer-proposal.html], [md file on GitHub | https://github.com/nchammas/spark-ec2/blob/packer/packer/proposal.md]) * [Initial implementation | https://github.com/nchammas/spark-ec2/tree/packer/packer] and [README | https://github.com/nchammas/spark-ec2/blob/packer/packer/README.md] * New AMIs generated by this implementation: [Base AMIs | https://github.com/nchammas/spark-ec2/tree/packer/ami-list/base], [Spark 1.1.0 Pre-Installed | https://github.com/nchammas/spark-ec2/tree/packer/ami-list/1.1.0] To try out the new AMIs with {{spark-ec2}}, you'll need to update [these | https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L47] [two | https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L593] lines (well, really, just the first one) to point to [my {{spark-ec2}} repo on the {{packer}} branch | https://github.com/nchammas/spark-ec2/tree/packer/packer]. Your candid feedback and/or improvements are most welcome! Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4291) Drop Code from network module names
[ https://issues.apache.org/jira/browse/SPARK-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4291. Resolution: Fixed Fix Version/s: 1.2.0 Drop Code from network module names - Key: SPARK-4291 URL: https://issues.apache.org/jira/browse/SPARK-4291 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Fix For: 1.2.0 In maven, the network modules have the suffix Code, which is inconsistent with other modules. {code} [INFO] Reactor Build Order: [INFO] [INFO] Spark Project Parent POM [INFO] Spark Project Common Network Code [INFO] Spark Project Shuffle Streaming Service Code [INFO] Spark Project Core [INFO] Spark Project Bagel [INFO] Spark Project GraphX [INFO] Spark Project Streaming [INFO] Spark Project Catalyst [INFO] Spark Project SQL [INFO] Spark Project ML Library [INFO] Spark Project Tools [INFO] Spark Project Hive [INFO] Spark Project REPL [INFO] Spark Project YARN Parent POM [INFO] Spark Project YARN Stable API [INFO] Spark Project Assembly [INFO] Spark Project External Twitter [INFO] Spark Project External Kafka [INFO] Spark Project External Flume Sink [INFO] Spark Project External Flume [INFO] Spark Project External ZeroMQ [INFO] Spark Project External MQTT [INFO] Spark Project Examples [INFO] Spark Project Yarn Shuffle Service Code {code} My proposal is to drop it especially before they make it into an official release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3648) Provide a script for fetching remote PR's for review
[ https://issues.apache.org/jira/browse/SPARK-3648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203287#comment-14203287 ] Apache Spark commented on SPARK-3648: - User 'pwendell' has created a pull request for this issue: https://github.com/apache/spark/pull/3165 Provide a script for fetching remote PR's for review Key: SPARK-3648 URL: https://issues.apache.org/jira/browse/SPARK-3648 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Patrick Wendell Assignee: Patrick Wendell I've found it's useful to have a small utility script for fetching specific pull requests locally when doing reviews. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org