[jira] [Updated] (SPARK-6667) hang while collect in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-6667: -- Affects Version/s: 1.4.0 1.3.1 hang while collect in PySpark - Key: SPARK-6667 URL: https://issues.apache.org/jira/browse/SPARK-6667 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.1, 1.4.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Critical PySpark tests hang while collecting: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6578) Outbound channel in network library is not thread-safe, can lead to fetch failures
[ https://issues.apache.org/jira/browse/SPARK-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391731#comment-14391731 ] Apache Spark commented on SPARK-6578: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/5319 Outbound channel in network library is not thread-safe, can lead to fetch failures -- Key: SPARK-6578 URL: https://issues.apache.org/jira/browse/SPARK-6578 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Blocker Fix For: 1.3.1, 1.4.0 There is a very narrow race in the outbound channel of the network library. While netty guarantees that the inbound channel is thread-safe, the same is not true for the outbound channel: multiple threads can be writing and running the pipeline at the same time. This leads to an issue with MessageEncoder and the optimization it performs for zero-copy of file data: since a single RPC can be broken into multiple buffers (for , example when replying to a chunk request), if you have multiple threads writing these RPCs then they can be mixed up in the final socket. That breaks framing and will cause the receiving side to not understand the messages. Patch coming up shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6578) Outbound channel in network library is not thread-safe, can lead to fetch failures
[ https://issues.apache.org/jira/browse/SPARK-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391737#comment-14391737 ] Reynold Xin commented on SPARK-6578: We should patch 1.2.x too. [~vanzin] mind submitting a patch for that branch? Outbound channel in network library is not thread-safe, can lead to fetch failures -- Key: SPARK-6578 URL: https://issues.apache.org/jira/browse/SPARK-6578 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Blocker Fix For: 1.3.1, 1.4.0 There is a very narrow race in the outbound channel of the network library. While netty guarantees that the inbound channel is thread-safe, the same is not true for the outbound channel: multiple threads can be writing and running the pipeline at the same time. This leads to an issue with MessageEncoder and the optimization it performs for zero-copy of file data: since a single RPC can be broken into multiple buffers (for , example when replying to a chunk request), if you have multiple threads writing these RPCs then they can be mixed up in the final socket. That breaks framing and will cause the receiving side to not understand the messages. Patch coming up shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6580) Optimize LogisticRegressionModel.predictPoint
[ https://issues.apache.org/jira/browse/SPARK-6580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6580. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5249 [https://github.com/apache/spark/pull/5249] Optimize LogisticRegressionModel.predictPoint - Key: SPARK-6580 URL: https://issues.apache.org/jira/browse/SPARK-6580 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Yanbo Liang Priority: Minor Fix For: 1.4.0 LogisticRegressionModel.predictPoint could be optimized some. There are several checks which could be moved outside loops or even outside predictPoint to initialization of the model. Some include: {code} require(numFeatures == weightMatrix.size) val dataWithBiasSize = weightMatrix.size / (numClasses - 1) val weightsArray = weightMatrix match { ... if (dataMatrix.size + 1 == dataWithBiasSize) {... {code} Also, for multiclass, the 2 loops (over numClasses and margins) could be combined into 1 loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated SPARK-5682: Attachment: (was: Design Document of Encrypted Spark Shuffle_20150401.docx) Add encrypted shuffle in spark -- Key: SPARK-5682 URL: https://issues.apache.org/jira/browse/SPARK-5682 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: liyunzhang_intel Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx, Design Document of Encrypted Spark Shuffle_20150318.docx, Design Document of Encrypted Spark Shuffle_20150402.docx Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle data safer. This feature is necessary in spark. AES is a specification for the encryption of electronic data. There are 5 common modes in AES. CTR is one of the modes. We use two codec JceAesCtrCryptoCodec and OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl provides. Because ugi credential info is used in the process of encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated SPARK-5682: Attachment: Design Document of Encrypted Spark Shuffle_20150402.docx Add encrypted shuffle in spark -- Key: SPARK-5682 URL: https://issues.apache.org/jira/browse/SPARK-5682 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: liyunzhang_intel Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx, Design Document of Encrypted Spark Shuffle_20150318.docx, Design Document of Encrypted Spark Shuffle_20150402.docx Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle data safer. This feature is necessary in spark. AES is a specification for the encryption of electronic data. There are 5 common modes in AES. CTR is one of the modes. We use two codec JceAesCtrCryptoCodec and OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl provides. Because ugi credential info is used in the process of encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6660) MLLibPythonAPI.pythonToJava doesn't recognize object arrays
[ https://issues.apache.org/jira/browse/SPARK-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6660. -- Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Issue resolved by pull request 5318 [https://github.com/apache/spark/pull/5318] MLLibPythonAPI.pythonToJava doesn't recognize object arrays --- Key: SPARK-6660 URL: https://issues.apache.org/jira/browse/SPARK-6660 Project: Spark Issue Type: Bug Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Fix For: 1.3.1, 1.4.0 {code} points = MLUtils.loadLabeledPoints(sc, ...) _to_java_object_rdd(points).count() {code} throws exception {code} --- Py4JJavaError Traceback (most recent call last) ipython-input-22-5b481e99a111 in module() 1 jrdd.count() /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o510.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 in stage 114.0 failed 4 times, most recent failure: Lost task 18.3 in stage 114.0 (TID 1133, ip-10-0-130-35.us-west-2.compute.internal): java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1090) at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1087) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1472) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Commented] (SPARK-5989) Model import/export for LDAModel
[ https://issues.apache.org/jira/browse/SPARK-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392047#comment-14392047 ] Joseph K. Bradley commented on SPARK-5989: -- If there are other tasks on your plate, I would prioritize those ahead of this. Model import/export for LDAModel Key: SPARK-5989 URL: https://issues.apache.org/jira/browse/SPARK-5989 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar Add save/load for LDAModel and its local and distributed variants. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5989) Model import/export for LDAModel
[ https://issues.apache.org/jira/browse/SPARK-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392046#comment-14392046 ] Joseph K. Bradley commented on SPARK-5989: -- Yes, but this may be affected by this PR, which I aim to review very soon: [https://github.com/apache/spark/pull/4807] Model import/export for LDAModel Key: SPARK-5989 URL: https://issues.apache.org/jira/browse/SPARK-5989 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Add save/load for LDAModel and its local and distributed variants. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-6575) Add configuration to disable schema merging while converting metastore Parquet tables
[ https://issues.apache.org/jira/browse/SPARK-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reopened SPARK-6575: - Add configuration to disable schema merging while converting metastore Parquet tables - Key: SPARK-6575 URL: https://issues.apache.org/jira/browse/SPARK-6575 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker Fix For: 1.3.1, 1.4.0 Consider a metastore Parquet table that # doesn't have schema evolution issue # has lots of data files and/or partitions In this case, driver schema merging can be both slow and unnecessary. Would be good to have a configuration to let the use disable schema merging when converting such a metastore Parquet table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6667) hang while collect in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-6667: -- Priority: Critical (was: Major) hang while collect in PySpark - Key: SPARK-6667 URL: https://issues.apache.org/jira/browse/SPARK-6667 Project: Spark Issue Type: Bug Components: PySpark Reporter: Davies Liu Assignee: Davies Liu Priority: Critical PySpark tests hang while collecting: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6660) MLLibPythonAPI.pythonToJava doesn't recognize object arrays
[ https://issues.apache.org/jira/browse/SPARK-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391720#comment-14391720 ] Apache Spark commented on SPARK-6660: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/5318 MLLibPythonAPI.pythonToJava doesn't recognize object arrays --- Key: SPARK-6660 URL: https://issues.apache.org/jira/browse/SPARK-6660 Project: Spark Issue Type: Bug Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical {code} points = MLUtils.loadLabeledPoints(sc, ...) _to_java_object_rdd(points).count() {code} throws exception {code} --- Py4JJavaError Traceback (most recent call last) ipython-input-22-5b481e99a111 in module() 1 jrdd.count() /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o510.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 in stage 114.0 failed 4 times, most recent failure: Lost task 18.3 in stage 114.0 (TID 1133, ip-10-0-130-35.us-west-2.compute.internal): java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1090) at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1087) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1472) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6660) MLLibPythonAPI.pythonToJava doesn't recognize object arrays
[ https://issues.apache.org/jira/browse/SPARK-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6660: --- Assignee: Xiangrui Meng (was: Apache Spark) MLLibPythonAPI.pythonToJava doesn't recognize object arrays --- Key: SPARK-6660 URL: https://issues.apache.org/jira/browse/SPARK-6660 Project: Spark Issue Type: Bug Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical {code} points = MLUtils.loadLabeledPoints(sc, ...) _to_java_object_rdd(points).count() {code} throws exception {code} --- Py4JJavaError Traceback (most recent call last) ipython-input-22-5b481e99a111 in module() 1 jrdd.count() /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o510.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 in stage 114.0 failed 4 times, most recent failure: Lost task 18.3 in stage 114.0 (TID 1133, ip-10-0-130-35.us-west-2.compute.internal): java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1090) at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1087) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1472) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6576) DenseMatrix in PySpark should support indexing
[ https://issues.apache.org/jira/browse/SPARK-6576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6576. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5232 [https://github.com/apache/spark/pull/5232] DenseMatrix in PySpark should support indexing -- Key: SPARK-6576 URL: https://issues.apache.org/jira/browse/SPARK-6576 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Manoj Kumar Priority: Minor Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6576) DenseMatrix in PySpark should support indexing
[ https://issues.apache.org/jira/browse/SPARK-6576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6576: - Assignee: Manoj Kumar DenseMatrix in PySpark should support indexing -- Key: SPARK-6576 URL: https://issues.apache.org/jira/browse/SPARK-6576 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Manoj Kumar Assignee: Manoj Kumar Priority: Minor Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6663) Use Literal.create instead of constructor
[ https://issues.apache.org/jira/browse/SPARK-6663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6663: --- Assignee: Davies Liu (was: Apache Spark) Use Literal.create instead of constructor - Key: SPARK-6663 URL: https://issues.apache.org/jira/browse/SPARK-6663 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu In order to do type checking and conversion, we should use Literal.create() instead of constructor to create Literal with DataType. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6663) Use Literal.create instead of constructor
[ https://issues.apache.org/jira/browse/SPARK-6663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391861#comment-14391861 ] Apache Spark commented on SPARK-6663: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/5320 Use Literal.create instead of constructor - Key: SPARK-6663 URL: https://issues.apache.org/jira/browse/SPARK-6663 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu In order to do type checking and conversion, we should use Literal.create() instead of constructor to create Literal with DataType. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6663) Use Literal.create instead of constructor
[ https://issues.apache.org/jira/browse/SPARK-6663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6663: --- Assignee: Apache Spark (was: Davies Liu) Use Literal.create instead of constructor - Key: SPARK-6663 URL: https://issues.apache.org/jira/browse/SPARK-6663 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Apache Spark In order to do type checking and conversion, we should use Literal.create() instead of constructor to create Literal with DataType. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6639) Create a new script to start multiple masters
[ https://issues.apache.org/jira/browse/SPARK-6639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6639: --- Assignee: Apache Spark Create a new script to start multiple masters - Key: SPARK-6639 URL: https://issues.apache.org/jira/browse/SPARK-6639 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 1.3.0 Environment: all Reporter: Tao Wang Assignee: Apache Spark Priority: Minor Labels: patch Original Estimate: 336h Remaining Estimate: 336h start-slaves.sh script is able to read from slaves file and start slaves node in multiple boxes. However in standalone mode if I want to use multiple masters, I’ll have to start masters in each individual box, and also need to provide the list of masters’ hostname+port to each worker. ( start-slaves.sh only take 1 master ip+port for now) I wonder should we create a new script called start-masters.sh to read conf/masters file? Also start-slaves.sh script may need to change a little bit so that master list can be passed to worker nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6639) Create a new script to start multiple masters
[ https://issues.apache.org/jira/browse/SPARK-6639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392037#comment-14392037 ] Apache Spark commented on SPARK-6639: - User 'wangzhonnew' has created a pull request for this issue: https://github.com/apache/spark/pull/5323 Create a new script to start multiple masters - Key: SPARK-6639 URL: https://issues.apache.org/jira/browse/SPARK-6639 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 1.3.0 Environment: all Reporter: Tao Wang Priority: Minor Labels: patch Original Estimate: 336h Remaining Estimate: 336h start-slaves.sh script is able to read from slaves file and start slaves node in multiple boxes. However in standalone mode if I want to use multiple masters, I’ll have to start masters in each individual box, and also need to provide the list of masters’ hostname+port to each worker. ( start-slaves.sh only take 1 master ip+port for now) I wonder should we create a new script called start-masters.sh to read conf/masters file? Also start-slaves.sh script may need to change a little bit so that master list can be passed to worker nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6639) Create a new script to start multiple masters
[ https://issues.apache.org/jira/browse/SPARK-6639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6639: --- Assignee: (was: Apache Spark) Create a new script to start multiple masters - Key: SPARK-6639 URL: https://issues.apache.org/jira/browse/SPARK-6639 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 1.3.0 Environment: all Reporter: Tao Wang Priority: Minor Labels: patch Original Estimate: 336h Remaining Estimate: 336h start-slaves.sh script is able to read from slaves file and start slaves node in multiple boxes. However in standalone mode if I want to use multiple masters, I’ll have to start masters in each individual box, and also need to provide the list of masters’ hostname+port to each worker. ( start-slaves.sh only take 1 master ip+port for now) I wonder should we create a new script called start-masters.sh to read conf/masters file? Also start-slaves.sh script may need to change a little bit so that master list can be passed to worker nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6661) Python type errors should print type, not object
Joseph K. Bradley created SPARK-6661: Summary: Python type errors should print type, not object Key: SPARK-6661 URL: https://issues.apache.org/jira/browse/SPARK-6661 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor In MLlib PySpark, we sometimes test the type of an object and print an error if the object is of the wrong type. E.g.: [https://github.com/apache/spark/blob/f084c5de14eb10a6aba82a39e03e7877926ebb9e/python/pyspark/mllib/regression.py#L173] These checks should print the type, not the actual object. E.g., if the object cannot be converted to a string, then the check linked above will give a warning like this: {code} TypeError: not all arguments converted during string formatting {code} ...which is weird for the user. There may be other places in the codebase where this is an issue, so we need to check through and verify. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6659) Spark SQL 1.3 cannot read json file that only with a record.
[ https://issues.apache.org/jira/browse/SPARK-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391898#comment-14391898 ] Adnan Khan commented on SPARK-6659: --- {quote} Spark SQL 1.3 cannot read json file that only with a record. here is my json file's content. \{name:milo,age,24\} {quote} that's invalid json. there's a colon missing between age and 24. i just tried it with valid json from a single record and it works. instead of {{df: org.apache.spark.sql.DataFrame = \[_corrupt_record: string\]}} you should see {{df: org.apache.spark.sql.DataFrame = \[age: bigint, name: string\]}} Spark SQL 1.3 cannot read json file that only with a record. Key: SPARK-6659 URL: https://issues.apache.org/jira/browse/SPARK-6659 Project: Spark Issue Type: Bug Reporter: luochenghui Dear friends: Spark SQL 1.3 cannot read json file that only with a record. here is my json file's content. {name:milo,age,24} when i run Spark SQL under the local mode,it throws an exception rg.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns _corrupt_record; what i had done: 1 ./spark-shell 2 scala val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@5f3be6c8 scala val df = sqlContext.jsonFile(/home/milo/person.json) 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=280248975 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 159.9 KB, free 267.1 MB) 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(22692) called with curMem=163705, maxMem=280248975 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 267.1 MB) 15/03/19 22:11:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:35842 (size: 22.2 KB, free: 267.2 MB) 15/03/19 22:11:45 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/03/19 22:11:45 INFO SparkContext: Created broadcast 0 from textFile at JSONRelation.scala:98 15/03/19 22:11:47 INFO FileInputFormat: Total input paths to process : 1 15/03/19 22:11:47 INFO SparkContext: Starting job: reduce at JsonRDD.scala:51 15/03/19 22:11:47 INFO DAGScheduler: Got job 0 (reduce at JsonRDD.scala:51) with 1 output partitions (allowLocal=false) 15/03/19 22:11:47 INFO DAGScheduler: Final stage: Stage 0(reduce at JsonRDD.scala:51) 15/03/19 22:11:47 INFO DAGScheduler: Parents of final stage: List() 15/03/19 22:11:47 INFO DAGScheduler: Missing parents: List() 15/03/19 22:11:47 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[3] at map at JsonRDD.scala:51), which has no missing parents 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(3184) called with curMem=186397, maxMem=280248975 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 267.1 MB) 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(2251) called with curMem=189581, maxMem=280248975 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.2 KB, free 267.1 MB) 15/03/19 22:11:47 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:35842 (size: 2.2 KB, free: 267.2 MB) 15/03/19 22:11:47 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 15/03/19 22:11:47 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839 15/03/19 22:11:48 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MapPartitionsRDD[3] at map at JsonRDD.scala:51) 15/03/19 22:11:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 15/03/19 22:11:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1291 bytes) 15/03/19 22:11:48 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 15/03/19 22:11:48 INFO HadoopRDD: Input split: file:/home/milo/person.json:0+26 15/03/19 22:11:48 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/03/19 22:11:48 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/03/19 22:11:48 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 15/03/19 22:11:48 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 15/03/19 22:11:48 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 15/03/19 22:11:49 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2023 bytes result sent to driver 15/03/19 22:11:49 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1209 ms on localhost (1/1) 15/03/19 22:11:49
[jira] [Created] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
Florian Verhein created SPARK-6664: -- Summary: Split Ordered RDD into multiple RDDs by keys (boundaries or intervals) Key: SPARK-6664 URL: https://issues.apache.org/jira/browse/SPARK-6664 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Florian Verhein I can't find this functionality (if I missed something, apologies!), but it would be very useful for evaluating ml models. Use case example: suppose you have pre-processed web logs for a few months, and now want to split it into a training set (where you train a model to predict some aspect of site accesses, perhaps per user) and an out of time test set (where you evaluate how well your model performs in the future). This example has just a single split, but in general you could want more for cross validation. You may also want to have multiple overlaping intervals. Specification: 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith boundary. 2. More complex alternative (but similar under the hood): provide a sequence of possibly overlapping intervals, and return the RDDs containing values within those intervals. Implementation ideas / notes for 1: - The ordered RDDs are likely RangePartitioned (or there should be a simple way to find ranges from partitions in an ordered RDD) - Find the partitions containing the boundary, and split them in two. - Construct the new RDDs from the original partitions (and any split ones) I suspect this could be done by launching only a few jobs to split the partitions containing the boundaries. Alternatively, it might be possible to decorate these partitions and use them in more than one RDD. I.e. let one of these partitions (for boundary i) be p. Apply two decorators p' and p'', where p' is masks out values above the ith boundary, and p'' masks out values below the ith boundary. Any operations on these partitions apply only to values not masked out. Then assign p' to the ith output RDD and p'' to the (i+1)th output RDD. If I understand Spark correctly, this should not require any jobs. Not sure whether it's worth trying this optimisation. Implementation ideas / notes for 2: This is very similar, except that we have to handle entire (or parts) of partitions belonging to more than one output RDD, since they are no longer mutually exclusive. But since RDDs are immutable(?), the decorator idea should still work? Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6667) hang while collect in PySpark
Davies Liu created SPARK-6667: - Summary: hang while collect in PySpark Key: SPARK-6667 URL: https://issues.apache.org/jira/browse/SPARK-6667 Project: Spark Issue Type: Bug Components: PySpark Reporter: Davies Liu Assignee: Davies Liu PySpark tests hang while collecting: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-6618) HiveMetastoreCatalog.lookupRelation should use fine-grained lock
[ https://issues.apache.org/jira/browse/SPARK-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reopened SPARK-6618: - HiveMetastoreCatalog.lookupRelation should use fine-grained lock Key: SPARK-6618 URL: https://issues.apache.org/jira/browse/SPARK-6618 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Fix For: 1.3.1, 1.4.0 Right now the entire method of HiveMetastoreCatalog.lookupRelation has a lock (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L173) and the scope of lock will cover resolving data source tables (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L93). So, lookupRelation can be extremely expensive when we are doing expensive operations like parquet schema discovery. So, we should use fine-grained lock for lookupRelation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390950#comment-14390950 ] Vinay Shukla commented on SPARK-6646: - This use case can benefit from running Spark inside a Mobile App Server. An App server that takes care of horizontal issues such as security, networking, etc will allow Spark to focus on the real hard problem of data processing in a lightening fast manner. There is another idea of using having Spark leverage [parallel quantum computing | http://people.csail.mit.edu/nhm/pqc.pdf] but I suppose that calls for another JIRA. Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6651) Delegate dense vector arithmetics to the underly numpy array
Xiangrui Meng created SPARK-6651: Summary: Delegate dense vector arithmetics to the underly numpy array Key: SPARK-6651 URL: https://issues.apache.org/jira/browse/SPARK-6651 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng It is convenient to delegate dense linear algebra operations to numpy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message
[ https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391035#comment-14391035 ] Spiro Michaylov commented on SPARK-6587: I appreciate the comment, and clearly I was confused about a couple of things, but I wonder if there's still an interesting RFE here. My example was intended to internalize into case classes some really powerful Spark SQL behavior that I've observed when inferring schema for JSON: {code} val textConflict = sc.parallelize(Seq( {\key\:42}, {\key\:\hello\}, {\key\:false} ), 4) val jsonConflict = sqlContext.jsonRDD(textConflict) jsonConflict.printSchema() jsonConflict.registerTempTable(conflict) sqlContext.sql(SELECT * FROM conflict).show() {code} Which produces: {noformat} root |-- key: string (nullable = true) key 42 hello false {noformat} This behavior is IMO a *really* nice compromise: a type is inferred, it is approximate, so there are certain things you can't do in the query, but type information is still preserved when returning results from the query. I was trying to help the poster on StackOverflow to achieve similar behavior from case classes, and I thought a hierarchy was necessary. While I was clearly barking up the wrong tree, I wonder: a) Is it intended that these kinds of type conflicts be handled as elegantly when one is using case classes rather than the JSON parser? b) Is there already a way to do it that I failed to find? (Suspicion: no, but I've been wrong before ...) c) If respectively YES and NO, how should the RFE be phrased? Inferring schema for case class hierarchy fails with mysterious message --- Key: SPARK-6587 URL: https://issues.apache.org/jira/browse/SPARK-6587 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: At least Windows 8, Scala 2.11.2. Reporter: Spiro Michaylov (Don't know if this is a functionality bug, error reporting bug or an RFE ...) I define the following hierarchy: {code} private abstract class MyHolder private case class StringHolder(s: String) extends MyHolder private case class IntHolder(i: Int) extends MyHolder private case class BooleanHolder(b: Boolean) extends MyHolder {code} and a top level case class: {code} private case class Thing(key: Integer, foo: MyHolder) {code} When I try to convert it: {code} val things = Seq( Thing(1, IntHolder(42)), Thing(2, StringHolder(hello)), Thing(3, BooleanHolder(false)) ) val thingsDF = sc.parallelize(things, 4).toDF() thingsDF.registerTempTable(things) val all = sqlContext.sql(SELECT * from things) {code} I get the following stack trace: {noformat} Exception in thread main scala.MatchError: sql.CaseClassSchemaProblem.MyHolder (of class scala.reflect.internal.Types$ClassNoArgsTypeRef) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157) at scala.collection.immutable.List.map(List.scala:276) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312) at org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250) at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35) at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) {noformat} I wrote this to answer [a question on StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql] which uses a much simpler approach and suffers the same problem. Looking at what seems to me to be the
[jira] [Created] (SPARK-6652) SQLContext and HiveContext do not handle tricky names well
Max Seiden created SPARK-6652: - Summary: SQLContext and HiveContext do not handle tricky names well Key: SPARK-6652 URL: https://issues.apache.org/jira/browse/SPARK-6652 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Max Seiden h3. Summary There are cases where both the SQLContext and HiveContext fail when handling tricky names (containing UTF-8, tabs, newlines, etc) well. For example, the following string: {noformat} val tricky = Tricky-\u4E2D[x.][\,/\\n * ? é\n$(x)\t(':;#!^-Name {noformat} causes the following exceptions during parsing and resolution (respectively). h5. SQLContext parse failure {noformat} // pseudocode val data = 0 until 100 val rdd = sc.parallelize(data) val schema = StructType(StructField(Tricky, IntegerType, false) :: Nil) val schemaRDD = sqlContext.applySchema(rdd.map(i = Row(i)), schema) schemaRDD.registerAsTable(Tricky) sqlContext.sql(sselect `$Tricky` from `$Tricky`) java.lang.RuntimeException: [1.33] failure: ``UNION'' expected but ErrorToken(``' expected but found) found select `Tricky-中[x.][,/\n * ? é ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33) at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174) at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31) at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83) at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:303) {noformat} h5. HiveContext resolution failure {noformat} // pseudocode val data = 0 until 100 val rdd = sc.parallelize(data) val schema = StructType(StructField(Tricky, IntegerType, false) :: Nil) val schemaRDD = sqlContext.applySchema(rdd.map(i = Row(i)), schema) schemaRDD.registerAsTable(Tricky) sqlContext.sql(sselect `$Tricky` from `$Tricky`).collect() // the parse is ok in this case... 15/04/01 10:41:48 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead 15/04/01 10:41:48 INFO ParseDriver: Parsing command: select `Tricky-中[x.][,/\n * ? é $(x) (':;#!^-Name` from `Tricky-中[x.][,/\n * ? é $(x) (':;#!^-Name` 15/04/01 10:41:48 INFO ParseDriver: Parse Completed // but resolution fails org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: 'Tricky-中[x.][,/\n * ? é $(x) (':;#!^-Name, tree: 'Project ['Tricky-中[x.][,/\n * ? é $(x) (':;#!^-Name] Subquery tricky-中[x.][,/\n * ? é $(x) (':;#!^-name LogicalRDD [Tricky-中[x.][,/\n * ? é $(x) (':;#!^-Name#2], MappedRDD[16] at map at console:30 at
[jira] [Assigned] (SPARK-6650) ExecutorAllocationManager never stops
[ https://issues.apache.org/jira/browse/SPARK-6650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6650: --- Assignee: Apache Spark ExecutorAllocationManager never stops - Key: SPARK-6650 URL: https://issues.apache.org/jira/browse/SPARK-6650 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Marcelo Vanzin Assignee: Apache Spark {{ExecutorAllocationManager}} doesn't even have a stop() method. That means that when the owning SparkContext goes away, the internal thread it uses to schedule its activities remains alive. That means it constantly spams the logs and does who knows what else that could affect any future contexts that are allocated. It's particularly evil during unit tests, since it slows down everything else after the suite is run, leaving multiple threads behind. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6653) New configuration property to specify port for sparkYarnAM actor system
Manoj Samel created SPARK-6653: -- Summary: New configuration property to specify port for sparkYarnAM actor system Key: SPARK-6653 URL: https://issues.apache.org/jira/browse/SPARK-6653 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.3.0 Environment: Spark On Yarn Reporter: Manoj Samel In 1.3.0 code line sparkYarnAM actor system is started on random port. See org.apache.spark.deploy.yarn ApplicationMaster.scala:282 actorSystem = AkkaUtils.createActorSystem(sparkYarnAM, Utils.localHostName, 0, conf = sparkConf, securityManager = securityMgr)._1 This may be issue when ports between Spark client and the Yarn cluster are limited by firewall and not all ports are open between client and Yarn cluster. Proposal is to introduce new property spark.am.actor.port and change code to val port = sparkConf.getInt(spark.am.actor.port, 0) actorSystem = AkkaUtils.createActorSystem(sparkYarnAM, Utils.localHostName, port, conf = sparkConf, securityManager = securityMgr)._1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6655) We need to read the schema of a data source table stored in spark.sql.sources.schema property
[ https://issues.apache.org/jira/browse/SPARK-6655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6655: --- Assignee: Apache Spark (was: Yin Huai) We need to read the schema of a data source table stored in spark.sql.sources.schema property - Key: SPARK-6655 URL: https://issues.apache.org/jira/browse/SPARK-6655 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Yin Huai Assignee: Apache Spark Priority: Blocker Fix For: 1.3.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6655) We need to read the schema of a data source table stored in spark.sql.sources.schema property
[ https://issues.apache.org/jira/browse/SPARK-6655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6655: --- Assignee: Yin Huai (was: Apache Spark) We need to read the schema of a data source table stored in spark.sql.sources.schema property - Key: SPARK-6655 URL: https://issues.apache.org/jira/browse/SPARK-6655 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Fix For: 1.3.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6655) We need to read the schema of a data source table stored in spark.sql.sources.schema property
[ https://issues.apache.org/jira/browse/SPARK-6655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391282#comment-14391282 ] Apache Spark commented on SPARK-6655: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/5313 We need to read the schema of a data source table stored in spark.sql.sources.schema property - Key: SPARK-6655 URL: https://issues.apache.org/jira/browse/SPARK-6655 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Fix For: 1.3.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()
[ https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-5960: Target Version/s: 1.4.0 (was: 1.3.1) Allow AWS credentials to be passed to KinesisUtils.createStream() - Key: SPARK-5960 URL: https://issues.apache.org/jira/browse/SPARK-5960 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Chris Fregly Assignee: Chris Fregly While IAM roles are preferable, we're seeing a lot of cases where we need to pass AWS credentials when creating the KinesisReceiver. Notes: * Make sure we don't log the credentials anywhere * Maintain compatibility with existing KinesisReceiver-based code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6654) Update Kinesis Streaming impls (both KCL-based and Direct) to use latest aws-java-sdk and kinesis-client-library
Chris Fregly created SPARK-6654: --- Summary: Update Kinesis Streaming impls (both KCL-based and Direct) to use latest aws-java-sdk and kinesis-client-library Key: SPARK-6654 URL: https://issues.apache.org/jira/browse/SPARK-6654 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Chris Fregly -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6656) Allow the application name to be passed in versus pulling from SparkContext.getAppName()
Chris Fregly created SPARK-6656: --- Summary: Allow the application name to be passed in versus pulling from SparkContext.getAppName() Key: SPARK-6656 URL: https://issues.apache.org/jira/browse/SPARK-6656 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Chris Fregly this is useful for the scenario where Kinesis Spark Streaming is being invoked from the Spark Shell. in this case, the application name in the SparkContext is pre-set to Spark Shell. this isn't a common or recommended use case, but it's best to make this configurable outside of SparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4184) Improve Spark Streaming documentation to address commonly-asked questions
[ https://issues.apache.org/jira/browse/SPARK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-4184: Target Version/s: 1.4.0 (was: 1.3.1) Improve Spark Streaming documentation to address commonly-asked questions -- Key: SPARK-4184 URL: https://issues.apache.org/jira/browse/SPARK-4184 Project: Spark Issue Type: Documentation Components: Streaming Reporter: Chris Fregly Labels: documentation, streaming Improve Streaming documentation including API descriptions, concurrency/thread safety, fault tolerance, replication, checkpointing, scalability, resource allocation and utilization, back pressure, and monitoring. also, add a section to the kinesis streaming guide describing how to use IAM roles with the Spark Kinesis Receiver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6655) We need to read the schema of a data source table stored in spark.sql.sources.schema property
Yin Huai created SPARK-6655: --- Summary: We need to read the schema of a data source table stored in spark.sql.sources.schema property Key: SPARK-6655 URL: https://issues.apache.org/jira/browse/SPARK-6655 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Fix For: 1.3.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6651) Delegate dense vector arithmetics to the underly numpy array
[ https://issues.apache.org/jira/browse/SPARK-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6651: --- Assignee: Apache Spark (was: Xiangrui Meng) Delegate dense vector arithmetics to the underly numpy array Key: SPARK-6651 URL: https://issues.apache.org/jira/browse/SPARK-6651 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Apache Spark It is convenient to delegate dense linear algebra operations to numpy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6651) Delegate dense vector arithmetics to the underly numpy array
[ https://issues.apache.org/jira/browse/SPARK-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391020#comment-14391020 ] Apache Spark commented on SPARK-6651: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/5312 Delegate dense vector arithmetics to the underly numpy array Key: SPARK-6651 URL: https://issues.apache.org/jira/browse/SPARK-6651 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng It is convenient to delegate dense linear algebra operations to numpy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6651) Delegate dense vector arithmetics to the underly numpy array
[ https://issues.apache.org/jira/browse/SPARK-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6651: --- Assignee: Xiangrui Meng (was: Apache Spark) Delegate dense vector arithmetics to the underly numpy array Key: SPARK-6651 URL: https://issues.apache.org/jira/browse/SPARK-6651 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng It is convenient to delegate dense linear algebra operations to numpy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5972) Cache residuals for GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391521#comment-14391521 ] Manoj Kumar commented on SPARK-5972: [~josephkb] This should be done independently of evaluateEachIteration right? (In the sense, that evaluateEachIteration should not be used in the GradientBoostedTrees code that does this, that is caching the error and residuals, since the model has not been trained yet) Cache residuals for GradientBoostedTrees during training Key: SPARK-5972 URL: https://issues.apache.org/jira/browse/SPARK-5972 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor In gradient boosting, the current model's prediction is re-computed for each training instance on every iteration. The current residual (cumulative prediction of previously trained trees in the ensemble) should be cached. That could reduce both computation (only computing the prediction of the most recently trained tree) and communication (only sending the most recently trained tree to the workers). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6657) Fix Python doc build warnings
Joseph K. Bradley created SPARK-6657: Summary: Fix Python doc build warnings Key: SPARK-6657 URL: https://issues.apache.org/jira/browse/SPARK-6657 Project: Spark Issue Type: Documentation Components: Documentation, MLlib, PySpark, SQL, Streaming Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Trivial Reported by [~rxin] {code} /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:15: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:16: WARNING: Block quote ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:18: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:22: WARNING: Definition list ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:28: WARNING: Definition list ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:13: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:14: WARNING: Block quote ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:16: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:18: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.collect:1: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.orderBy:3: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.sort:3: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.take:1: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title underline too short. pyspark.streaming.kafka module /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title underline too short. pyspark.streaming.kafka module {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5972) Cache residuals for GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391536#comment-14391536 ] Joseph K. Bradley commented on SPARK-5972: -- They should be at least partly separate, in that evaluateEachIteration itself will not be used for this. But this JIRA and evaluateEachIteration might be able to share some code to avoid code duplication. Cache residuals for GradientBoostedTrees during training Key: SPARK-5972 URL: https://issues.apache.org/jira/browse/SPARK-5972 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor In gradient boosting, the current model's prediction is re-computed for each training instance on every iteration. The current residual (cumulative prediction of previously trained trees in the ensemble) should be cached. That could reduce both computation (only computing the prediction of the most recently trained tree) and communication (only sending the most recently trained tree to the workers). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6658) Incorrect DataFrame Documentation Type References
[ https://issues.apache.org/jira/browse/SPARK-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chet Mancini updated SPARK-6658: Priority: Trivial (was: Major) Incorrect DataFrame Documentation Type References - Key: SPARK-6658 URL: https://issues.apache.org/jira/browse/SPARK-6658 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Chet Mancini Priority: Trivial Labels: docuentation Original Estimate: 5m Remaining Estimate: 5m A few methods under DataFrame incorrectly refer to the receiver as an RDD in their documentation. * createJDBCTable * insertIntoJDBC * registerTempTable -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6658) Incorrect DataFrame Documentation Type References
Chet Mancini created SPARK-6658: --- Summary: Incorrect DataFrame Documentation Type References Key: SPARK-6658 URL: https://issues.apache.org/jira/browse/SPARK-6658 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Chet Mancini A few methods under DataFrame incorrectly refer to the receiver as an RDD in their documentation. * createJDBCTable * insertIntoJDBC * registerTempTable -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6658) Incorrect DataFrame Documentation Type References
[ https://issues.apache.org/jira/browse/SPARK-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chet Mancini updated SPARK-6658: Labels: documentation (was: docuentation) Incorrect DataFrame Documentation Type References - Key: SPARK-6658 URL: https://issues.apache.org/jira/browse/SPARK-6658 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Chet Mancini Priority: Trivial Labels: documentation Original Estimate: 5m Remaining Estimate: 5m A few methods under DataFrame incorrectly refer to the receiver as an RDD in their documentation. * createJDBCTable * insertIntoJDBC * registerTempTable -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6659) Spark SQL 1.3 cannot read json file that only with a record.
luochenghui created SPARK-6659: -- Summary: Spark SQL 1.3 cannot read json file that only with a record. Key: SPARK-6659 URL: https://issues.apache.org/jira/browse/SPARK-6659 Project: Spark Issue Type: Bug Reporter: luochenghui Dear friends: Spark SQL 1.3 cannot read json file that only with a record. here is my json file's content. {name:milo,age,24} when i run Spark SQL under the local mode,it throws an exception rg.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns _corrupt_record; what i had done: 1 ./spark-shell 2 scala val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@5f3be6c8 scala val df = sqlContext.jsonFile(/home/milo/person.json) 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=280248975 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 159.9 KB, free 267.1 MB) 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(22692) called with curMem=163705, maxMem=280248975 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 267.1 MB) 15/03/19 22:11:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:35842 (size: 22.2 KB, free: 267.2 MB) 15/03/19 22:11:45 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/03/19 22:11:45 INFO SparkContext: Created broadcast 0 from textFile at JSONRelation.scala:98 15/03/19 22:11:47 INFO FileInputFormat: Total input paths to process : 1 15/03/19 22:11:47 INFO SparkContext: Starting job: reduce at JsonRDD.scala:51 15/03/19 22:11:47 INFO DAGScheduler: Got job 0 (reduce at JsonRDD.scala:51) with 1 output partitions (allowLocal=false) 15/03/19 22:11:47 INFO DAGScheduler: Final stage: Stage 0(reduce at JsonRDD.scala:51) 15/03/19 22:11:47 INFO DAGScheduler: Parents of final stage: List() 15/03/19 22:11:47 INFO DAGScheduler: Missing parents: List() 15/03/19 22:11:47 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[3] at map at JsonRDD.scala:51), which has no missing parents 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(3184) called with curMem=186397, maxMem=280248975 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 267.1 MB) 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(2251) called with curMem=189581, maxMem=280248975 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.2 KB, free 267.1 MB) 15/03/19 22:11:47 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:35842 (size: 2.2 KB, free: 267.2 MB) 15/03/19 22:11:47 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 15/03/19 22:11:47 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839 15/03/19 22:11:48 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MapPartitionsRDD[3] at map at JsonRDD.scala:51) 15/03/19 22:11:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 15/03/19 22:11:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1291 bytes) 15/03/19 22:11:48 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 15/03/19 22:11:48 INFO HadoopRDD: Input split: file:/home/milo/person.json:0+26 15/03/19 22:11:48 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/03/19 22:11:48 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/03/19 22:11:48 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 15/03/19 22:11:48 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 15/03/19 22:11:48 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 15/03/19 22:11:49 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2023 bytes result sent to driver 15/03/19 22:11:49 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1209 ms on localhost (1/1) 15/03/19 22:11:49 INFO DAGScheduler: Stage 0 (reduce at JsonRDD.scala:51) finished in 1.308 s 15/03/19 22:11:49 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/03/19 22:11:49 INFO DAGScheduler: Job 0 finished: reduce at JsonRDD.scala:51, took 2.002429 s df: org.apache.spark.sql.DataFrame = [_corrupt_record: string] 3 scala df.select(name).show() 15/03/19 22:12:44 INFO BlockManager: Removing broadcast 1 15/03/19 22:12:44 INFO BlockManager: Removing block broadcast_1_piece0 15/03/19 22:12:44 INFO MemoryStore: Block broadcast_1_piece0 of size 2251 dropped from memory (free 280059394) 15/03/19 22:12:44 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:35842 in memory (size: 2.2 KB,
[jira] [Assigned] (SPARK-6658) Incorrect DataFrame Documentation Type References
[ https://issues.apache.org/jira/browse/SPARK-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6658: --- Assignee: Apache Spark Incorrect DataFrame Documentation Type References - Key: SPARK-6658 URL: https://issues.apache.org/jira/browse/SPARK-6658 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Chet Mancini Assignee: Apache Spark Priority: Trivial Labels: documentation Original Estimate: 5m Remaining Estimate: 5m A few methods under DataFrame incorrectly refer to the receiver as an RDD in their documentation. * createJDBCTable * insertIntoJDBC * registerTempTable -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6658) Incorrect DataFrame Documentation Type References
[ https://issues.apache.org/jira/browse/SPARK-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391563#comment-14391563 ] Apache Spark commented on SPARK-6658: - User 'chetmancini' has created a pull request for this issue: https://github.com/apache/spark/pull/5316 Incorrect DataFrame Documentation Type References - Key: SPARK-6658 URL: https://issues.apache.org/jira/browse/SPARK-6658 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Chet Mancini Priority: Trivial Labels: documentation Original Estimate: 5m Remaining Estimate: 5m A few methods under DataFrame incorrectly refer to the receiver as an RDD in their documentation. * createJDBCTable * insertIntoJDBC * registerTempTable -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6658) Incorrect DataFrame Documentation Type References
[ https://issues.apache.org/jira/browse/SPARK-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6658: --- Assignee: (was: Apache Spark) Incorrect DataFrame Documentation Type References - Key: SPARK-6658 URL: https://issues.apache.org/jira/browse/SPARK-6658 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Chet Mancini Priority: Trivial Labels: documentation Original Estimate: 5m Remaining Estimate: 5m A few methods under DataFrame incorrectly refer to the receiver as an RDD in their documentation. * createJDBCTable * insertIntoJDBC * registerTempTable -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5989) Model import/export for LDAModel
[ https://issues.apache.org/jira/browse/SPARK-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390209#comment-14390209 ] Manoj Kumar edited comment on SPARK-5989 at 4/1/15 10:04 PM: - [~josephkb] Can this be assigned to me? Thanks! was (Author: mechcoder): Can this be assigned to me? Thanks! Model import/export for LDAModel Key: SPARK-5989 URL: https://issues.apache.org/jira/browse/SPARK-5989 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Add save/load for LDAModel and its local and distributed variants. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6657) Fix Python doc build warnings
[ https://issues.apache.org/jira/browse/SPARK-6657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6657: --- Assignee: Apache Spark (was: Joseph K. Bradley) Fix Python doc build warnings - Key: SPARK-6657 URL: https://issues.apache.org/jira/browse/SPARK-6657 Project: Spark Issue Type: Documentation Components: Documentation, MLlib, PySpark, SQL, Streaming Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Apache Spark Priority: Trivial Reported by [~rxin] {code} /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:15: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:16: WARNING: Block quote ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:18: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:22: WARNING: Definition list ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:28: WARNING: Definition list ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:13: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:14: WARNING: Block quote ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:16: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:18: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.collect:1: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.orderBy:3: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.sort:3: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.take:1: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title underline too short. pyspark.streaming.kafka module /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title underline too short. pyspark.streaming.kafka module {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6657) Fix Python doc build warnings
[ https://issues.apache.org/jira/browse/SPARK-6657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391593#comment-14391593 ] Apache Spark commented on SPARK-6657: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/5317 Fix Python doc build warnings - Key: SPARK-6657 URL: https://issues.apache.org/jira/browse/SPARK-6657 Project: Spark Issue Type: Documentation Components: Documentation, MLlib, PySpark, SQL, Streaming Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Trivial Reported by [~rxin] {code} /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:15: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:16: WARNING: Block quote ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:18: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:22: WARNING: Definition list ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:28: WARNING: Definition list ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:13: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:14: WARNING: Block quote ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:16: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:18: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.collect:1: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.orderBy:3: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.sort:3: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.take:1: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title underline too short. pyspark.streaming.kafka module /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title underline too short. pyspark.streaming.kafka module {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6657) Fix Python doc build warnings
[ https://issues.apache.org/jira/browse/SPARK-6657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6657: --- Assignee: Joseph K. Bradley (was: Apache Spark) Fix Python doc build warnings - Key: SPARK-6657 URL: https://issues.apache.org/jira/browse/SPARK-6657 Project: Spark Issue Type: Documentation Components: Documentation, MLlib, PySpark, SQL, Streaming Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Trivial Reported by [~rxin] {code} /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:15: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:16: WARNING: Block quote ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:18: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:22: WARNING: Definition list ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:28: WARNING: Definition list ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:13: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:14: WARNING: Block quote ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:16: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:18: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.collect:1: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.orderBy:3: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.sort:3: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.take:1: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title underline too short. pyspark.streaming.kafka module /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title underline too short. pyspark.streaming.kafka module {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6650) ExecutorAllocationManager never stops
Marcelo Vanzin created SPARK-6650: - Summary: ExecutorAllocationManager never stops Key: SPARK-6650 URL: https://issues.apache.org/jira/browse/SPARK-6650 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Marcelo Vanzin {{ExecutorAllocationManager}} doesn't even have a stop() method. That means that when the owning SparkContext goes away, the internal thread it uses to schedule its activities remains alive. That means it constantly spams the logs and does who knows what else that could affect any future contexts that are allocated. It's particularly evil during unit tests, since it slows down everything else after the suite is run, leaving multiple threads behind. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391335#comment-14391335 ] Deenar Toraskar commented on SPARK-6646: maybe Spark 2.0 should be branded i-Spark Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6642) Change the lambda weight to number of explicit ratings in implicit ALS
[ https://issues.apache.org/jira/browse/SPARK-6642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391412#comment-14391412 ] Apache Spark commented on SPARK-6642: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/5314 Change the lambda weight to number of explicit ratings in implicit ALS -- Key: SPARK-6642 URL: https://issues.apache.org/jira/browse/SPARK-6642 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Until SPARK-6637 is resolved, we should switch back to the 1.2 lambda weighting strategy to be consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6642) Change the lambda weight to number of explicit ratings in implicit ALS
[ https://issues.apache.org/jira/browse/SPARK-6642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6642: --- Assignee: Xiangrui Meng (was: Apache Spark) Change the lambda weight to number of explicit ratings in implicit ALS -- Key: SPARK-6642 URL: https://issues.apache.org/jira/browse/SPARK-6642 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Until SPARK-6637 is resolved, we should switch back to the 1.2 lambda weighting strategy to be consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6642) Change the lambda weight to number of explicit ratings in implicit ALS
[ https://issues.apache.org/jira/browse/SPARK-6642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6642: --- Assignee: Apache Spark (was: Xiangrui Meng) Change the lambda weight to number of explicit ratings in implicit ALS -- Key: SPARK-6642 URL: https://issues.apache.org/jira/browse/SPARK-6642 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Apache Spark Until SPARK-6637 is resolved, we should switch back to the 1.2 lambda weighting strategy to be consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6373) Add SSL/TLS for the Netty based BlockTransferService
[ https://issues.apache.org/jira/browse/SPARK-6373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391452#comment-14391452 ] Jeffrey Turpin commented on SPARK-6373: --- Hey Aaron, Sorry for the delay... I have cleaned things up a bit and refactored the implementation to be more inline with our earlier conversation... Have a look at https://github.com/turp1twin/spark/commit/d976a7ab9b57e26fc180d649fd084a6acb9d027e and let me know your thoughts... Jeff Add SSL/TLS for the Netty based BlockTransferService - Key: SPARK-6373 URL: https://issues.apache.org/jira/browse/SPARK-6373 Project: Spark Issue Type: New Feature Components: Block Manager, Shuffle Affects Versions: 1.2.1 Reporter: Jeffrey Turpin Add the ability to allow for secure communications (SSL/TLS) for the Netty based BlockTransferService and the ExternalShuffleClient. This ticket will hopefully start the conversation around potential designs... Below is a reference to a WIP prototype which implements this functionality (prototype)... I have attempted to disrupt as little code as possible and tried to follow the current code structure (for the most part) in the areas I modified. I also studied how Hadoop achieves encrypted shuffle (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html) https://github.com/turp1twin/spark/commit/024b559f27945eb63068d1badf7f82e4e7c3621c -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391456#comment-14391456 ] Matei Zaharia commented on SPARK-6646: -- Not to rain on the parade here, but I worry that focusing on mobile phones is short-sighted. Does this design present a path forward for the Internet of Things as well? You'd want something that runs on Arduino, Raspberry Pi, etc. We already have MQTT input in Spark Streaming so we could consider using MQTT to replace Netty for shuffle as well. Has anybody benchmarked that? Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6651) Delegate dense vector arithmetics to the underly numpy array
[ https://issues.apache.org/jira/browse/SPARK-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6651. -- Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Issue resolved by pull request 5312 [https://github.com/apache/spark/pull/5312] Delegate dense vector arithmetics to the underly numpy array Key: SPARK-6651 URL: https://issues.apache.org/jira/browse/SPARK-6651 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.3.1, 1.4.0 It is convenient to delegate dense linear algebra operations to numpy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6658) Incorrect DataFrame Documentation Type References
[ https://issues.apache.org/jira/browse/SPARK-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chet Mancini resolved SPARK-6658. - Resolution: Implemented Incorrect DataFrame Documentation Type References - Key: SPARK-6658 URL: https://issues.apache.org/jira/browse/SPARK-6658 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Chet Mancini Priority: Trivial Labels: documentation Original Estimate: 5m Remaining Estimate: 5m A few methods under DataFrame incorrectly refer to the receiver as an RDD in their documentation. * createJDBCTable * insertIntoJDBC * registerTempTable -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391623#comment-14391623 ] Neelesh Srinivas Salian commented on SPARK-2243: I hit this error. Simply closed the previous context. Any other workaround? Support multiple SparkContexts in the same JVM -- Key: SPARK-2243 URL: https://issues.apache.org/jira/browse/SPARK-2243 Project: Spark Issue Type: New Feature Components: Block Manager, Spark Core Affects Versions: 0.7.0, 1.0.0, 1.1.0 Reporter: Miguel Angel Fernandez Diaz We're developing a platform where we create several Spark contexts for carrying out different calculations. Is there any restriction when using several Spark contexts? We have two contexts, one for Spark calculations and another one for Spark Streaming jobs. The next error arises when we first execute a Spark calculation and, once the execution is finished, a Spark Streaming job is launched: {code} 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391629#comment-14391629 ] Sean Owen commented on SPARK-2243: -- Sorry to be flippant but really the answer is to not make multiple SparkContexts. Simply run in separate JVMs, or share access to one SparkContext in the JVM. Support multiple SparkContexts in the same JVM -- Key: SPARK-2243 URL: https://issues.apache.org/jira/browse/SPARK-2243 Project: Spark Issue Type: New Feature Components: Block Manager, Spark Core Affects Versions: 0.7.0, 1.0.0, 1.1.0 Reporter: Miguel Angel Fernandez Diaz We're developing a platform where we create several Spark contexts for carrying out different calculations. Is there any restriction when using several Spark contexts? We have two contexts, one for Spark calculations and another one for Spark Streaming jobs. The next error arises when we first execute a Spark calculation and, once the execution is finished, a Spark Streaming job is launched: {code} 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at
[jira] [Assigned] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5556: --- Assignee: Pedro Rodriguez (was: Apache Spark) Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5556: --- Assignee: Apache Spark (was: Pedro Rodriguez) Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391636#comment-14391636 ] Tathagata Das commented on SPARK-6646: -- I vehemently disagree. I dont think we should choose names that subtly indicates Spark runs on IPhone only. That is frankly not true. We want to embrace all platforms without any bias. Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391683#comment-14391683 ] Venkat Krishnamurthy commented on SPARK-6646: - I'm looking forward to the release that targets smart watches. It could have the pleasant side effect of making time stand still while executors crunch away in the background, obviating any need for performance tuning. Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6660) MLLibPythonAPI.pythonToJava doesn't recognize object arrays
Xiangrui Meng created SPARK-6660: Summary: MLLibPythonAPI.pythonToJava doesn't recognize object arrays Key: SPARK-6660 URL: https://issues.apache.org/jira/browse/SPARK-6660 Project: Spark Issue Type: Bug Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical {code} points = MLUtils.loadLabeledPoints(sc, ...) _to_java_object_rdd(points).count() {code} throws exception {code} --- Py4JJavaError Traceback (most recent call last) ipython-input-22-5b481e99a111 in module() 1 jrdd.count() /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o510.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 in stage 114.0 failed 4 times, most recent failure: Lost task 18.3 in stage 114.0 (TID 1133, ip-10-0-130-35.us-west-2.compute.internal): java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1090) at org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1087) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1472) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6578) Outbound channel in network library is not thread-safe, can lead to fetch failures
[ https://issues.apache.org/jira/browse/SPARK-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-6578. Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Outbound channel in network library is not thread-safe, can lead to fetch failures -- Key: SPARK-6578 URL: https://issues.apache.org/jira/browse/SPARK-6578 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Blocker Fix For: 1.3.1, 1.4.0 There is a very narrow race in the outbound channel of the network library. While netty guarantees that the inbound channel is thread-safe, the same is not true for the outbound channel: multiple threads can be writing and running the pipeline at the same time. This leads to an issue with MessageEncoder and the optimization it performs for zero-copy of file data: since a single RPC can be broken into multiple buffers (for , example when replying to a chunk request), if you have multiple threads writing these RPCs then they can be mixed up in the final socket. That breaks framing and will cause the receiving side to not understand the messages. Patch coming up shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org