date:20150401


[ 
https://issues.apache.org/jira/browse/SPARK-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391731#comment-14391731
 ] 

Apache Spark commented on SPARK-6578:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5319

 Outbound channel in network library is not thread-safe, can lead to fetch 
 failures
 --

 Key: SPARK-6578
 URL: https://issues.apache.org/jira/browse/SPARK-6578
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Blocker
 Fix For: 1.3.1, 1.4.0


 There is a very narrow race in the outbound channel of the network library. 
 While netty guarantees that the inbound channel is thread-safe, the same is 
 not true for the outbound channel: multiple threads can be writing and 
 running the pipeline at the same time.
 This leads to an issue with MessageEncoder and the optimization it performs 
 for zero-copy of file data: since a single RPC can be broken into multiple 
 buffers (for , example when replying to a chunk request), if you have 
 multiple threads writing these RPCs then they can be mixed up in the final 
 socket. That breaks framing and will cause the receiving side to not 
 understand the messages.
 Patch coming up shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6578) Outbound channel in network library is not thread-safe, can lead to fetch failures

2015-04-01 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391737#comment-14391737
 ] 

Reynold Xin commented on SPARK-6578:


We should patch 1.2.x too. [~vanzin] mind submitting a patch for that branch?


 Outbound channel in network library is not thread-safe, can lead to fetch 
 failures
 --

 Key: SPARK-6578
 URL: https://issues.apache.org/jira/browse/SPARK-6578
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Blocker
 Fix For: 1.3.1, 1.4.0


 There is a very narrow race in the outbound channel of the network library. 
 While netty guarantees that the inbound channel is thread-safe, the same is 
 not true for the outbound channel: multiple threads can be writing and 
 running the pipeline at the same time.
 This leads to an issue with MessageEncoder and the optimization it performs 
 for zero-copy of file data: since a single RPC can be broken into multiple 
 buffers (for , example when replying to a chunk request), if you have 
 multiple threads writing these RPCs then they can be mixed up in the final 
 socket. That breaks framing and will cause the receiving side to not 
 understand the messages.
 Patch coming up shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6580) Optimize LogisticRegressionModel.predictPoint


 [ 
https://issues.apache.org/jira/browse/SPARK-6580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6580.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5249
[https://github.com/apache/spark/pull/5249]

 Optimize LogisticRegressionModel.predictPoint
 -

 Key: SPARK-6580
 URL: https://issues.apache.org/jira/browse/SPARK-6580
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang
Priority: Minor
 Fix For: 1.4.0


 LogisticRegressionModel.predictPoint could be optimized some.  There are 
 several checks which could be moved outside loops or even outside 
 predictPoint to initialization of the model.
 Some include:
 {code}
 require(numFeatures == weightMatrix.size)
 val dataWithBiasSize = weightMatrix.size / (numClasses - 1)
 val weightsArray = weightMatrix match { ...
 if (dataMatrix.size + 1 == dataWithBiasSize) {...
 {code}
 Also, for multiclass, the 2 loops (over numClasses and margins) could be 
 combined into 1 loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5682) Add encrypted shuffle in spark

2015-04-01 Thread liyunzhang_intel (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

liyunzhang_intel updated SPARK-5682:

Attachment: (was: Design Document of Encrypted Spark
Shuffle_20150401.docx)

Add encrypted shuffle in spark
--

Key: SPARK-5682
URL: https://issues.apache.org/jira/browse/SPARK-5682
Project: Spark
Issue Type: New Feature
Components: Shuffle
Reporter: liyunzhang_intel
Attachments: Design Document of Encrypted Spark
Shuffle_20150209.docx, Design Document of Encrypted Spark
Shuffle_20150318.docx, Design Document of Encrypted Spark
Shuffle_20150402.docx

Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle
data safer. This feature is necessary in spark. AES is a specification for
the encryption of electronic data. There are 5 common modes in AES. CTR is
one of the modes. We use two codec JceAesCtrCryptoCodec and
OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used
in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk
provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl
provides.
Because ugi credential info is used in the process of encrypted shuffle, we
first enable encrypted shuffle on spark-on-yarn framework.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5682) Add encrypted shuffle in spark

2015-04-01 Thread liyunzhang_intel (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated SPARK-5682:

Attachment: Design Document of Encrypted Spark Shuffle_20150402.docx

 Add encrypted shuffle in spark
 --

 Key: SPARK-5682
 URL: https://issues.apache.org/jira/browse/SPARK-5682
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Reporter: liyunzhang_intel
 Attachments: Design Document of Encrypted Spark 
 Shuffle_20150209.docx, Design Document of Encrypted Spark 
 Shuffle_20150318.docx, Design Document of Encrypted Spark 
 Shuffle_20150402.docx


 Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
 data safer. This feature is necessary in spark. AES  is a specification for 
 the encryption of electronic data. There are 5 common modes in AES. CTR is 
 one of the modes. We use two codec JceAesCtrCryptoCodec and 
 OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
 in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
 provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
 provides. 
 Because ugi credential info is used in the process of encrypted shuffle, we 
 first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6660) MLLibPythonAPI.pythonToJava doesn't recognize object arrays


 [ 
https://issues.apache.org/jira/browse/SPARK-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6660.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1

Issue resolved by pull request 5318
[https://github.com/apache/spark/pull/5318]

 MLLibPythonAPI.pythonToJava doesn't recognize object arrays
 ---

 Key: SPARK-6660
 URL: https://issues.apache.org/jira/browse/SPARK-6660
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
 Fix For: 1.3.1, 1.4.0


 {code}
 points = MLUtils.loadLabeledPoints(sc, ...)
 _to_java_object_rdd(points).count()
 {code}
 throws exception
 {code}
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-22-5b481e99a111 in module()
  1 jrdd.count()
 /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
  in __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer, self.gateway_client,
 -- 538 self.target_id, self.name)
 539 
 540 for temp_arg in temp_args:
 /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o510.count.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 
 in stage 114.0 failed 4 times, most recent failure: Lost task 18.3 in stage 
 114.0 (TID 1133, ip-10-0-130-35.us-west-2.compute.internal): 
 java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
 java.util.ArrayList
   at 
 org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1090)
   at 
 org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1087)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1472)
   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006)
   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Commented] (SPARK-5989) Model import/export for LDAModel


[ 
https://issues.apache.org/jira/browse/SPARK-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392047#comment-14392047
 ] 

Joseph K. Bradley commented on SPARK-5989:
--

If there are other tasks on your plate, I would prioritize those ahead of this.

 Model import/export for LDAModel
 

 Key: SPARK-5989
 URL: https://issues.apache.org/jira/browse/SPARK-5989
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 Add save/load for LDAModel and its local and distributed variants.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5989) Model import/export for LDAModel


[ 
https://issues.apache.org/jira/browse/SPARK-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392046#comment-14392046
 ] 

Joseph K. Bradley commented on SPARK-5989:
--

Yes, but this may be affected by this PR, which I aim to review very soon: 
[https://github.com/apache/spark/pull/4807]

 Model import/export for LDAModel
 

 Key: SPARK-5989
 URL: https://issues.apache.org/jira/browse/SPARK-5989
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Add save/load for LDAModel and its local and distributed variants.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-6575) Add configuration to disable schema merging while converting metastore Parquet tables

2015-04-01 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reopened SPARK-6575:
-

 Add configuration to disable schema merging while converting metastore 
 Parquet tables
 -

 Key: SPARK-6575
 URL: https://issues.apache.org/jira/browse/SPARK-6575
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.3.1, 1.4.0


 Consider a metastore Parquet table that
 # doesn't have schema evolution issue
 # has lots of data files and/or partitions
 In this case, driver schema merging can be both slow and unnecessary. Would 
 be good to have a configuration to let the use disable schema merging when 
 converting such a metastore Parquet table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6667) hang while collect in PySpark

2015-04-01 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-6667:
--
Priority: Critical  (was: Major)

 hang while collect in PySpark
 -

 Key: SPARK-6667
 URL: https://issues.apache.org/jira/browse/SPARK-6667
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical

 PySpark tests hang while collecting:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6660) MLLibPythonAPI.pythonToJava doesn't recognize object arrays


[ 
https://issues.apache.org/jira/browse/SPARK-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391720#comment-14391720
 ] 

Apache Spark commented on SPARK-6660:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/5318

 MLLibPythonAPI.pythonToJava doesn't recognize object arrays
 ---

 Key: SPARK-6660
 URL: https://issues.apache.org/jira/browse/SPARK-6660
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

 {code}
 points = MLUtils.loadLabeledPoints(sc, ...)
 _to_java_object_rdd(points).count()
 {code}
 throws exception
 {code}
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-22-5b481e99a111 in module()
  1 jrdd.count()
 /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
  in __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer, self.gateway_client,
 -- 538 self.target_id, self.name)
 539 
 540 for temp_arg in temp_args:
 /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o510.count.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 
 in stage 114.0 failed 4 times, most recent failure: Lost task 18.3 in stage 
 114.0 (TID 1133, ip-10-0-130-35.us-west-2.compute.internal): 
 java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
 java.util.ArrayList
   at 
 org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1090)
   at 
 org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1087)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1472)
   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006)
   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6660) MLLibPythonAPI.pythonToJava doesn't recognize object arrays


 [ 
https://issues.apache.org/jira/browse/SPARK-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6660:
---

Assignee: Xiangrui Meng  (was: Apache Spark)

 MLLibPythonAPI.pythonToJava doesn't recognize object arrays
 ---

 Key: SPARK-6660
 URL: https://issues.apache.org/jira/browse/SPARK-6660
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

 {code}
 points = MLUtils.loadLabeledPoints(sc, ...)
 _to_java_object_rdd(points).count()
 {code}
 throws exception
 {code}
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-22-5b481e99a111 in module()
  1 jrdd.count()
 /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
  in __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer, self.gateway_client,
 -- 538 self.target_id, self.name)
 539 
 540 for temp_arg in temp_args:
 /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o510.count.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 
 in stage 114.0 failed 4 times, most recent failure: Lost task 18.3 in stage 
 114.0 (TID 1133, ip-10-0-130-35.us-west-2.compute.internal): 
 java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
 java.util.ArrayList
   at 
 org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1090)
   at 
 org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1087)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1472)
   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006)
   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1006)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6576) DenseMatrix in PySpark should support indexing


 [ 
https://issues.apache.org/jira/browse/SPARK-6576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6576.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5232
[https://github.com/apache/spark/pull/5232]

 DenseMatrix in PySpark should support indexing
 --

 Key: SPARK-6576
 URL: https://issues.apache.org/jira/browse/SPARK-6576
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Priority: Minor
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6576) DenseMatrix in PySpark should support indexing


 [ 
https://issues.apache.org/jira/browse/SPARK-6576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6576:
-
Assignee: Manoj Kumar

 DenseMatrix in PySpark should support indexing
 --

 Key: SPARK-6576
 URL: https://issues.apache.org/jira/browse/SPARK-6576
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Assignee: Manoj Kumar
Priority: Minor
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6663) Use Literal.create instead of constructor


 [ 
https://issues.apache.org/jira/browse/SPARK-6663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6663:
---

Assignee: Davies Liu  (was: Apache Spark)

 Use Literal.create instead of constructor
 -

 Key: SPARK-6663
 URL: https://issues.apache.org/jira/browse/SPARK-6663
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu

 In order to do type checking and conversion, we should use Literal.create() 
 instead of constructor to create Literal with DataType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6663) Use Literal.create instead of constructor


[ 
https://issues.apache.org/jira/browse/SPARK-6663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391861#comment-14391861
 ] 

Apache Spark commented on SPARK-6663:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/5320

 Use Literal.create instead of constructor
 -

 Key: SPARK-6663
 URL: https://issues.apache.org/jira/browse/SPARK-6663
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu

 In order to do type checking and conversion, we should use Literal.create() 
 instead of constructor to create Literal with DataType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6663) Use Literal.create instead of constructor


 [ 
https://issues.apache.org/jira/browse/SPARK-6663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6663:
---

Assignee: Apache Spark  (was: Davies Liu)

 Use Literal.create instead of constructor
 -

 Key: SPARK-6663
 URL: https://issues.apache.org/jira/browse/SPARK-6663
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Apache Spark

 In order to do type checking and conversion, we should use Literal.create() 
 instead of constructor to create Literal with DataType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6639) Create a new script to start multiple masters


 [ 
https://issues.apache.org/jira/browse/SPARK-6639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6639:
---

Assignee: Apache Spark

 Create a new script to start multiple masters
 -

 Key: SPARK-6639
 URL: https://issues.apache.org/jira/browse/SPARK-6639
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 1.3.0
 Environment: all
Reporter: Tao Wang
Assignee: Apache Spark
Priority: Minor
  Labels: patch
   Original Estimate: 336h
  Remaining Estimate: 336h

 start-slaves.sh script is able to read from slaves file and start slaves node 
 in multiple boxes.
 However in standalone mode if I want to use multiple masters, I’ll have to 
 start masters in each individual box, and also need to provide the list of 
 masters’ hostname+port to each worker. ( start-slaves.sh only take 1 master 
 ip+port for now)
 I wonder should we create a new script called start-masters.sh to read 
 conf/masters file? Also start-slaves.sh script may need to change a little 
 bit so that master list can be passed to worker nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6639) Create a new script to start multiple masters


[ 
https://issues.apache.org/jira/browse/SPARK-6639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392037#comment-14392037
 ] 

Apache Spark commented on SPARK-6639:
-

User 'wangzhonnew' has created a pull request for this issue:
https://github.com/apache/spark/pull/5323

 Create a new script to start multiple masters
 -

 Key: SPARK-6639
 URL: https://issues.apache.org/jira/browse/SPARK-6639
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 1.3.0
 Environment: all
Reporter: Tao Wang
Priority: Minor
  Labels: patch
   Original Estimate: 336h
  Remaining Estimate: 336h

 start-slaves.sh script is able to read from slaves file and start slaves node 
 in multiple boxes.
 However in standalone mode if I want to use multiple masters, I’ll have to 
 start masters in each individual box, and also need to provide the list of 
 masters’ hostname+port to each worker. ( start-slaves.sh only take 1 master 
 ip+port for now)
 I wonder should we create a new script called start-masters.sh to read 
 conf/masters file? Also start-slaves.sh script may need to change a little 
 bit so that master list can be passed to worker nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6639) Create a new script to start multiple masters


 [ 
https://issues.apache.org/jira/browse/SPARK-6639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6639:
---

Assignee: (was: Apache Spark)

 Create a new script to start multiple masters
 -

 Key: SPARK-6639
 URL: https://issues.apache.org/jira/browse/SPARK-6639
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 1.3.0
 Environment: all
Reporter: Tao Wang
Priority: Minor
  Labels: patch
   Original Estimate: 336h
  Remaining Estimate: 336h

 start-slaves.sh script is able to read from slaves file and start slaves node 
 in multiple boxes.
 However in standalone mode if I want to use multiple masters, I’ll have to 
 start masters in each individual box, and also need to provide the list of 
 masters’ hostname+port to each worker. ( start-slaves.sh only take 1 master 
 ip+port for now)
 I wonder should we create a new script called start-masters.sh to read 
 conf/masters file? Also start-slaves.sh script may need to change a little 
 bit so that master list can be passed to worker nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6661) Python type errors should print type, not object

Joseph K. Bradley created SPARK-6661:


 Summary: Python type errors should print type, not object
 Key: SPARK-6661
 URL: https://issues.apache.org/jira/browse/SPARK-6661
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor


In MLlib PySpark, we sometimes test the type of an object and print an error if 
the object is of the wrong type.  E.g.:
[https://github.com/apache/spark/blob/f084c5de14eb10a6aba82a39e03e7877926ebb9e/python/pyspark/mllib/regression.py#L173]

These checks should print the type, not the actual object.  E.g., if the object 
cannot be converted to a string, then the check linked above will give a 
warning like this:
{code}
TypeError: not all arguments converted during string formatting
{code}
...which is weird for the user.

There may be other places in the codebase where this is an issue, so we need to 
check through and verify.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6659) Spark SQL 1.3 cannot read json file that only with a record.

2015-04-01 Thread Adnan Khan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391898#comment-14391898
 ] 

Adnan Khan commented on SPARK-6659:
---

{quote}
Spark SQL 1.3 cannot read json file that only with a record.
here is my json file's content.
\{name:milo,age,24\}
{quote}

that's invalid json. there's a colon missing between age and 24.

i just tried it with valid json from a single record and it works. instead of 
{{df: org.apache.spark.sql.DataFrame = \[_corrupt_record: string\]}} you should 
see 
{{df: org.apache.spark.sql.DataFrame = \[age: bigint, name: string\]}}

 Spark SQL 1.3 cannot read json file that only with a record.
 

 Key: SPARK-6659
 URL: https://issues.apache.org/jira/browse/SPARK-6659
 Project: Spark
  Issue Type: Bug
Reporter: luochenghui

 Dear friends:
  
 Spark SQL 1.3 cannot read json file that only with a record.
 here is my json file's content.
 {name:milo,age,24}
  
 when i run Spark SQL under the local mode,it throws an exception
 rg.apache.spark.sql.AnalysisException: cannot resolve 'name' given input 
 columns _corrupt_record;
  
 what i had done:
 1  ./spark-shell
 2 
 scala val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 sqlContext: org.apache.spark.sql.SQLContext = 
 org.apache.spark.sql.SQLContext@5f3be6c8
  
 scala val df = sqlContext.jsonFile(/home/milo/person.json)
 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(163705) called with 
 curMem=0, maxMem=280248975
 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0 stored as values in 
 memory (estimated size 159.9 KB, free 267.1 MB)
 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(22692) called with 
 curMem=163705, maxMem=280248975
 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
 in memory (estimated size 22.2 KB, free 267.1 MB)
 15/03/19 22:11:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
 on localhost:35842 (size: 22.2 KB, free: 267.2 MB)
 15/03/19 22:11:45 INFO BlockManagerMaster: Updated info of block 
 broadcast_0_piece0
 15/03/19 22:11:45 INFO SparkContext: Created broadcast 0 from textFile at 
 JSONRelation.scala:98
 15/03/19 22:11:47 INFO FileInputFormat: Total input paths to process : 1
 15/03/19 22:11:47 INFO SparkContext: Starting job: reduce at JsonRDD.scala:51
 15/03/19 22:11:47 INFO DAGScheduler: Got job 0 (reduce at JsonRDD.scala:51) 
 with 1 output partitions (allowLocal=false)
 15/03/19 22:11:47 INFO DAGScheduler: Final stage: Stage 0(reduce at 
 JsonRDD.scala:51)
 15/03/19 22:11:47 INFO DAGScheduler: Parents of final stage: List()
 15/03/19 22:11:47 INFO DAGScheduler: Missing parents: List()
 15/03/19 22:11:47 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[3] 
 at map at JsonRDD.scala:51), which has no missing parents
 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(3184) called with 
 curMem=186397, maxMem=280248975
 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1 stored as values in 
 memory (estimated size 3.1 KB, free 267.1 MB)
 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(2251) called with 
 curMem=189581, maxMem=280248975
 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes 
 in memory (estimated size 2.2 KB, free 267.1 MB)
 15/03/19 22:11:47 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory 
 on localhost:35842 (size: 2.2 KB, free: 267.2 MB)
 15/03/19 22:11:47 INFO BlockManagerMaster: Updated info of block 
 broadcast_1_piece0
 15/03/19 22:11:47 INFO SparkContext: Created broadcast 1 from broadcast at 
 DAGScheduler.scala:839
 15/03/19 22:11:48 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
 (MapPartitionsRDD[3] at map at JsonRDD.scala:51)
 15/03/19 22:11:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
 15/03/19 22:11:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
 localhost, PROCESS_LOCAL, 1291 bytes)
 15/03/19 22:11:48 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
 15/03/19 22:11:48 INFO HadoopRDD: Input split: 
 file:/home/milo/person.json:0+26
 15/03/19 22:11:48 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
 mapreduce.task.id
 15/03/19 22:11:48 INFO deprecation: mapred.task.id is deprecated. Instead, 
 use mapreduce.task.attempt.id
 15/03/19 22:11:48 INFO deprecation: mapred.task.is.map is deprecated. 
 Instead, use mapreduce.task.ismap
 15/03/19 22:11:48 INFO deprecation: mapred.task.partition is deprecated. 
 Instead, use mapreduce.task.partition
 15/03/19 22:11:48 INFO deprecation: mapred.job.id is deprecated. Instead, use 
 mapreduce.job.id
 15/03/19 22:11:49 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2023 
 bytes result sent to driver
 15/03/19 22:11:49 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
 in 1209 ms on localhost (1/1)
 15/03/19 22:11:49

[jira] [Created] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)

2015-04-01 Thread Florian Verhein (JIRA)

Florian Verhein created SPARK-6664:
--

 Summary: Split Ordered RDD into multiple RDDs by keys (boundaries 
or intervals)
 Key: SPARK-6664
 URL: https://issues.apache.org/jira/browse/SPARK-6664
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Florian Verhein



I can't find this functionality (if I missed something, apologies!), but it 
would be very useful for evaluating ml models.  

Use case example: 
suppose you have pre-processed web logs for a few months, and now want to split 
it into a training set (where you train a model to predict some aspect of site 
accesses, perhaps per user) and an out of time test set (where you evaluate how 
well your model performs in the future). This example has just a single split, 
but in general you could want more for cross validation. You may also want to 
have multiple overlaping intervals.   

Specification: 

1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith 
boundary.

2. More complex alternative (but similar under the hood): provide a sequence of 
possibly overlapping intervals, and return the RDDs containing values within 
those intervals. 

Implementation ideas / notes for 1:

- The ordered RDDs are likely RangePartitioned (or there should be a simple way 
to find ranges from partitions in an ordered RDD)
- Find the partitions containing the boundary, and split them in two.  
- Construct the new RDDs from the original partitions (and any split ones)

I suspect this could be done by launching only a few jobs to split the 
partitions containing the boundaries. 
Alternatively, it might be possible to decorate these partitions and use them 
in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
Apply two decorators p' and p'', where p' is masks out values above the ith 
boundary, and p'' masks out values below the ith boundary. Any operations on 
these partitions apply only to values not masked out. Then assign p' to the ith 
output RDD and p'' to the (i+1)th output RDD.
If I understand Spark correctly, this should not require any jobs. Not sure 
whether it's worth trying this optimisation.

Implementation ideas / notes for 2:
This is very similar, except that we have to handle entire (or parts) of 
partitions belonging to more than one output RDD, since they are no longer 
mutually exclusive. But since RDDs are immutable(?), the decorator idea should 
still work?

Thoughts?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6667) hang while collect in PySpark

2015-04-01 Thread Davies Liu (JIRA)

Davies Liu created SPARK-6667:
-

 Summary: hang while collect in PySpark
 Key: SPARK-6667
 URL: https://issues.apache.org/jira/browse/SPARK-6667
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu


PySpark tests hang while collecting:





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-6618) HiveMetastoreCatalog.lookupRelation should use fine-grained lock

2015-04-01 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reopened SPARK-6618:
-

 HiveMetastoreCatalog.lookupRelation should use fine-grained lock
 

 Key: SPARK-6618
 URL: https://issues.apache.org/jira/browse/SPARK-6618
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.3.1, 1.4.0


 Right now the entire method of HiveMetastoreCatalog.lookupRelation has a lock 
 (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L173)
  and the scope of lock will cover resolving data source tables 
 (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L93).
  So, lookupRelation can be extremely expensive when we are doing expensive 
 operations like parquet schema discovery. So, we should use fine-grained lock 
 for lookupRelation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Vinay Shukla (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390950#comment-14390950
 ] 

Vinay Shukla commented on SPARK-6646:
-

This use case can benefit from running Spark inside a Mobile App Server. An App 
server that takes care of horizontal issues such as security, networking, etc 
will allow Spark to focus on the real hard problem of data processing in a 
lightening fast manner.

There is another idea of using having Spark leverage [parallel quantum 
computing | http://people.csail.mit.edu/nhm/pqc.pdf] but I suppose that calls 
for another JIRA.

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6651) Delegate dense vector arithmetics to the underly numpy array

Xiangrui Meng created SPARK-6651:


 Summary: Delegate dense vector arithmetics to the underly numpy 
array
 Key: SPARK-6651
 URL: https://issues.apache.org/jira/browse/SPARK-6651
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


It is convenient to delegate dense linear algebra operations to numpy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message

2015-04-01 Thread Spiro Michaylov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391035#comment-14391035
 ] 

Spiro Michaylov commented on SPARK-6587:


I appreciate the comment, and clearly I was confused about a couple of things, 
but I wonder if there's still an interesting RFE here. My example was intended 
to internalize into case classes some really powerful Spark SQL behavior that 
I've observed when inferring schema for JSON: 

{code}
val textConflict = sc.parallelize(Seq(
  {\key\:42},
  {\key\:\hello\},
  {\key\:false}
), 4)

val jsonConflict = sqlContext.jsonRDD(textConflict)
jsonConflict.printSchema()
jsonConflict.registerTempTable(conflict)
sqlContext.sql(SELECT * FROM conflict).show()
{code}

Which produces:

{noformat}
root
 |-- key: string (nullable = true)

key  
42   
hello
false
{noformat}

This behavior is IMO a *really* nice compromise: a type is inferred, it is 
approximate, so there are certain things you can't do in the query, but type 
information is still preserved when returning results from the query. 

I was trying to help the poster on StackOverflow to achieve similar behavior 
from case classes, and I thought a hierarchy was necessary. While I was clearly 
barking up the wrong tree, I wonder:

a) Is it intended that these kinds of type conflicts be handled as elegantly 
when one is using case classes rather than the JSON parser?
b) Is there already a way to do it that I failed to find? (Suspicion: no, but 
I've been wrong before ...)
c) If respectively YES and NO, how should the RFE be phrased?


 Inferring schema for case class hierarchy fails with mysterious message
 ---

 Key: SPARK-6587
 URL: https://issues.apache.org/jira/browse/SPARK-6587
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: At least Windows 8, Scala 2.11.2.  
Reporter: Spiro Michaylov

 (Don't know if this is a functionality bug, error reporting bug or an RFE ...)
 I define the following hierarchy:
 {code}
 private abstract class MyHolder
 private case class StringHolder(s: String) extends MyHolder
 private case class IntHolder(i: Int) extends MyHolder
 private case class BooleanHolder(b: Boolean) extends MyHolder
 {code}
 and a top level case class:
 {code}
 private case class Thing(key: Integer, foo: MyHolder)
 {code}
 When I try to convert it:
 {code}
 val things = Seq(
   Thing(1, IntHolder(42)),
   Thing(2, StringHolder(hello)),
   Thing(3, BooleanHolder(false))
 )
 val thingsDF = sc.parallelize(things, 4).toDF()
 thingsDF.registerTempTable(things)
 val all = sqlContext.sql(SELECT * from things)
 {code}
 I get the following stack trace:
 {noformat}
 Exception in thread main scala.MatchError: 
 sql.CaseClassSchemaProblem.MyHolder (of class 
 scala.reflect.internal.Types$ClassNoArgsTypeRef)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157)
   at scala.collection.immutable.List.map(List.scala:276)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
   at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312)
   at 
 org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250)
   at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35)
   at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 {noformat}
 I wrote this to answer [a question on 
 StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql]
  which uses a much simpler approach and suffers the same problem.
 Looking at what seems to me to be the

[jira] [Created] (SPARK-6652) SQLContext and HiveContext do not handle tricky names well

2015-04-01 Thread Max Seiden (JIRA)

Max Seiden created SPARK-6652:
-

 Summary: SQLContext and HiveContext do not handle tricky names 
well
 Key: SPARK-6652
 URL: https://issues.apache.org/jira/browse/SPARK-6652
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Max Seiden


h3. Summary
There are cases where both the SQLContext and HiveContext fail when handling 
tricky names (containing UTF-8, tabs, newlines, etc) well. For example, the 
following string:

{noformat}
val tricky = Tricky-\u4E2D[x.][\,/\\n * ? é\n$(x)\t(':;#!^-Name
{noformat}

causes the following exceptions during parsing and resolution (respectively).

h5. SQLContext parse failure
{noformat}
// pseudocode
val data = 0 until 100
val rdd = sc.parallelize(data)
val schema = StructType(StructField(Tricky, IntegerType, false) :: Nil)
val schemaRDD = sqlContext.applySchema(rdd.map(i = Row(i)), schema)
schemaRDD.registerAsTable(Tricky)
sqlContext.sql(sselect `$Tricky` from `$Tricky`)

java.lang.RuntimeException: [1.33] failure: ``UNION'' expected but 
ErrorToken(``' expected but 
 found) found

select `Tricky-中[x.][,/\n * ? é

^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
at 
org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
at 
org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31)
at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:303)
{noformat}

h5. HiveContext resolution failure
{noformat}
// pseudocode
val data = 0 until 100
val rdd = sc.parallelize(data)
val schema = StructType(StructField(Tricky, IntegerType, false) :: Nil)
val schemaRDD = sqlContext.applySchema(rdd.map(i = Row(i)), schema)
schemaRDD.registerAsTable(Tricky)
sqlContext.sql(sselect `$Tricky` from `$Tricky`).collect()

// the parse is ok in this case...
15/04/01 10:41:48 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no 
longer has any effect.  Use hive.hmshandler.retry.* instead
15/04/01 10:41:48 INFO ParseDriver: Parsing command: select `Tricky-中[x.][,/\n 
* ? é
$(x)   (':;#!^-Name` from `Tricky-中[x.][,/\n * ? é
$(x)   (':;#!^-Name`
15/04/01 10:41:48 INFO ParseDriver: Parse Completed

// but resolution fails
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: 'Tricky-中[x.][,/\n * ? é
$(x)   (':;#!^-Name, tree:
'Project ['Tricky-中[x.][,/\n * ? é
$(x)   (':;#!^-Name]
 Subquery tricky-中[x.][,/\n * ? é
$(x)   (':;#!^-name
  LogicalRDD [Tricky-中[x.][,/\n * ? é
$(x)   (':;#!^-Name#2], MappedRDD[16] at map at console:30

at

[jira] [Assigned] (SPARK-6650) ExecutorAllocationManager never stops


 [ 
https://issues.apache.org/jira/browse/SPARK-6650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6650:
---

Assignee: Apache Spark

 ExecutorAllocationManager never stops
 -

 Key: SPARK-6650
 URL: https://issues.apache.org/jira/browse/SPARK-6650
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Marcelo Vanzin
Assignee: Apache Spark

 {{ExecutorAllocationManager}} doesn't even have a stop() method. That means 
 that when the owning SparkContext goes away, the internal thread it uses to 
 schedule its activities remains alive.
 That means it constantly spams the logs and does who knows what else that 
 could affect any future contexts that are allocated.
 It's particularly evil during unit tests, since it slows down everything else 
 after the suite is run, leaving multiple threads behind.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6653) New configuration property to specify port for sparkYarnAM actor system

2015-04-01 Thread Manoj Samel (JIRA)

Manoj Samel created SPARK-6653:
--

 Summary: New configuration property to specify port for 
sparkYarnAM actor system
 Key: SPARK-6653
 URL: https://issues.apache.org/jira/browse/SPARK-6653
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.0
 Environment: Spark On Yarn
Reporter: Manoj Samel


In 1.3.0 code line sparkYarnAM actor system is started on random port. See 
org.apache.spark.deploy.yarn ApplicationMaster.scala:282

actorSystem = AkkaUtils.createActorSystem(sparkYarnAM, Utils.localHostName, 
0, conf = sparkConf, securityManager = securityMgr)._1

This may be issue when ports between Spark client and the Yarn cluster are 
limited by firewall and not all ports are open between client and Yarn cluster.

Proposal is to introduce new property spark.am.actor.port and change code to

val port = sparkConf.getInt(spark.am.actor.port, 0)
actorSystem = AkkaUtils.createActorSystem(sparkYarnAM, 
Utils.localHostName, port,
  conf = sparkConf, securityManager = securityMgr)._1








--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6655) We need to read the schema of a data source table stored in spark.sql.sources.schema property


 [ 
https://issues.apache.org/jira/browse/SPARK-6655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6655:
---

Assignee: Apache Spark  (was: Yin Huai)

 We need to read the schema of a data source table stored in 
 spark.sql.sources.schema property
 -

 Key: SPARK-6655
 URL: https://issues.apache.org/jira/browse/SPARK-6655
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yin Huai
Assignee: Apache Spark
Priority: Blocker
 Fix For: 1.3.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6655) We need to read the schema of a data source table stored in spark.sql.sources.schema property


 [ 
https://issues.apache.org/jira/browse/SPARK-6655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6655:
---

Assignee: Yin Huai  (was: Apache Spark)

 We need to read the schema of a data source table stored in 
 spark.sql.sources.schema property
 -

 Key: SPARK-6655
 URL: https://issues.apache.org/jira/browse/SPARK-6655
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.3.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6655) We need to read the schema of a data source table stored in spark.sql.sources.schema property


[ 
https://issues.apache.org/jira/browse/SPARK-6655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391282#comment-14391282
 ] 

Apache Spark commented on SPARK-6655:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/5313

 We need to read the schema of a data source table stored in 
 spark.sql.sources.schema property
 -

 Key: SPARK-6655
 URL: https://issues.apache.org/jira/browse/SPARK-6655
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.3.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()


 [ 
https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-5960:

Target Version/s: 1.4.0  (was: 1.3.1)

 Allow AWS credentials to be passed to KinesisUtils.createStream()
 -

 Key: SPARK-5960
 URL: https://issues.apache.org/jira/browse/SPARK-5960
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly
Assignee: Chris Fregly

 While IAM roles are preferable, we're seeing a lot of cases where we need to 
 pass AWS credentials when creating the KinesisReceiver.
 Notes:
 * Make sure we don't log the credentials anywhere
 * Maintain compatibility with existing KinesisReceiver-based code.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6654) Update Kinesis Streaming impls (both KCL-based and Direct) to use latest aws-java-sdk and kinesis-client-library

Chris Fregly created SPARK-6654:
---

 Summary: Update Kinesis Streaming impls (both KCL-based and 
Direct) to use latest aws-java-sdk and kinesis-client-library
 Key: SPARK-6654
 URL: https://issues.apache.org/jira/browse/SPARK-6654
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6656) Allow the application name to be passed in versus pulling from SparkContext.getAppName()

Chris Fregly created SPARK-6656:
---

 Summary: Allow the application name to be passed in versus pulling 
from SparkContext.getAppName() 
 Key: SPARK-6656
 URL: https://issues.apache.org/jira/browse/SPARK-6656
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly


this is useful for the scenario where Kinesis Spark Streaming is being invoked 
from the Spark Shell.  in this case, the application name in the SparkContext 
is pre-set to Spark Shell.

this isn't a common or recommended use case, but it's best to make this 
configurable outside of SparkContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4184) Improve Spark Streaming documentation to address commonly-asked questions


 [ 
https://issues.apache.org/jira/browse/SPARK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-4184:

Target Version/s: 1.4.0  (was: 1.3.1)

 Improve Spark Streaming documentation to address commonly-asked questions 
 --

 Key: SPARK-4184
 URL: https://issues.apache.org/jira/browse/SPARK-4184
 Project: Spark
  Issue Type: Documentation
  Components: Streaming
Reporter: Chris Fregly
  Labels: documentation, streaming

 Improve Streaming documentation including API descriptions, 
 concurrency/thread safety, fault tolerance, replication, checkpointing, 
 scalability, resource allocation and utilization, back pressure, and 
 monitoring.
 also, add a section to the kinesis streaming guide describing how to use IAM 
 roles with the Spark Kinesis Receiver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6655) We need to read the schema of a data source table stored in spark.sql.sources.schema property

2015-04-01 Thread Yin Huai (JIRA)

Yin Huai created SPARK-6655:
---

 Summary: We need to read the schema of a data source table stored 
in spark.sql.sources.schema property
 Key: SPARK-6655
 URL: https://issues.apache.org/jira/browse/SPARK-6655
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.3.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6651) Delegate dense vector arithmetics to the underly numpy array


 [ 
https://issues.apache.org/jira/browse/SPARK-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6651:
---

Assignee: Apache Spark  (was: Xiangrui Meng)

 Delegate dense vector arithmetics to the underly numpy array
 

 Key: SPARK-6651
 URL: https://issues.apache.org/jira/browse/SPARK-6651
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Apache Spark

 It is convenient to delegate dense linear algebra operations to numpy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6651) Delegate dense vector arithmetics to the underly numpy array


[ 
https://issues.apache.org/jira/browse/SPARK-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391020#comment-14391020
 ] 

Apache Spark commented on SPARK-6651:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/5312

 Delegate dense vector arithmetics to the underly numpy array
 

 Key: SPARK-6651
 URL: https://issues.apache.org/jira/browse/SPARK-6651
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 It is convenient to delegate dense linear algebra operations to numpy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6651) Delegate dense vector arithmetics to the underly numpy array


 [ 
https://issues.apache.org/jira/browse/SPARK-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6651:
---

Assignee: Xiangrui Meng  (was: Apache Spark)

 Delegate dense vector arithmetics to the underly numpy array
 

 Key: SPARK-6651
 URL: https://issues.apache.org/jira/browse/SPARK-6651
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 It is convenient to delegate dense linear algebra operations to numpy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5972) Cache residuals for GradientBoostedTrees during training

2015-04-01 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391521#comment-14391521
 ] 

Manoj Kumar commented on SPARK-5972:


[~josephkb] This should be done independently of evaluateEachIteration right? 
(In the sense, that evaluateEachIteration should not be used in the 
GradientBoostedTrees code that does this, that is caching the error and 
residuals, since the model has not been trained yet)



 Cache residuals for GradientBoostedTrees during training
 

 Key: SPARK-5972
 URL: https://issues.apache.org/jira/browse/SPARK-5972
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 In gradient boosting, the current model's prediction is re-computed for each 
 training instance on every iteration.  The current residual (cumulative 
 prediction of previously trained trees in the ensemble) should be cached.  
 That could reduce both computation (only computing the prediction of the most 
 recently trained tree) and communication (only sending the most recently 
 trained tree to the workers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6657) Fix Python doc build warnings

Joseph K. Bradley created SPARK-6657:


 Summary: Fix Python doc build warnings
 Key: SPARK-6657
 URL: https://issues.apache.org/jira/browse/SPARK-6657
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib, PySpark, SQL, Streaming
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Trivial


Reported by [~rxin]

{code}
/scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
pyspark.mllib.tree.RandomForest.trainClassifier:15: ERROR: Unexpected 
indentation.

/scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
pyspark.mllib.tree.RandomForest.trainClassifier:16: WARNING: Block quote ends 
without a blank line; unexpected unindent.

/scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
pyspark.mllib.tree.RandomForest.trainClassifier:18: ERROR: Unexpected 
indentation.

/scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
pyspark.mllib.tree.RandomForest.trainClassifier:22: WARNING: Definition list 
ends without a blank line; unexpected unindent.

/scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
pyspark.mllib.tree.RandomForest.trainClassifier:28: WARNING: Definition list 
ends without a blank line; unexpected unindent.

/scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
pyspark.mllib.tree.RandomForest.trainRegressor:13: ERROR: Unexpected 
indentation.

/scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
pyspark.mllib.tree.RandomForest.trainRegressor:14: WARNING: Block quote ends 
without a blank line; unexpected unindent.

/scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
pyspark.mllib.tree.RandomForest.trainRegressor:16: ERROR: Unexpected 
indentation.

/scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
pyspark.mllib.tree.RandomForest.trainRegressor:18: ERROR: Unexpected 
indentation.

/scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
pyspark.sql.DataFrame.collect:1: WARNING: Inline interpreted text or phrase 
reference start-string without end-string.

/scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
pyspark.sql.DataFrame.orderBy:3: WARNING: Inline interpreted text or phrase 
reference start-string without end-string.

/scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
pyspark.sql.DataFrame.sort:3: WARNING: Inline interpreted text or phrase 
reference start-string without end-string.

/scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
pyspark.sql.DataFrame.take:1: WARNING: Inline interpreted text or phrase 
reference start-string without end-string.

/scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title 
underline too short.



pyspark.streaming.kafka module



/scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title 
underline too short.



pyspark.streaming.kafka module


{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5972) Cache residuals for GradientBoostedTrees during training


[ 
https://issues.apache.org/jira/browse/SPARK-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391536#comment-14391536
 ] 

Joseph K. Bradley commented on SPARK-5972:
--

They should be at least partly separate, in that evaluateEachIteration itself 
will not be used for this.  But this JIRA and evaluateEachIteration might be 
able to share some code to avoid code duplication.

 Cache residuals for GradientBoostedTrees during training
 

 Key: SPARK-5972
 URL: https://issues.apache.org/jira/browse/SPARK-5972
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 In gradient boosting, the current model's prediction is re-computed for each 
 training instance on every iteration.  The current residual (cumulative 
 prediction of previously trained trees in the ensemble) should be cached.  
 That could reduce both computation (only computing the prediction of the most 
 recently trained tree) and communication (only sending the most recently 
 trained tree to the workers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6658) Incorrect DataFrame Documentation Type References


 [ 
https://issues.apache.org/jira/browse/SPARK-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chet Mancini updated SPARK-6658:

Priority: Trivial  (was: Major)

 Incorrect DataFrame Documentation Type References
 -

 Key: SPARK-6658
 URL: https://issues.apache.org/jira/browse/SPARK-6658
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Chet Mancini
Priority: Trivial
  Labels: docuentation
   Original Estimate: 5m
  Remaining Estimate: 5m

 A few methods under DataFrame incorrectly refer to the receiver as an RDD in 
 their documentation.
 * createJDBCTable
 * insertIntoJDBC
 * registerTempTable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6658) Incorrect DataFrame Documentation Type References

Chet Mancini created SPARK-6658:
---

 Summary: Incorrect DataFrame Documentation Type References
 Key: SPARK-6658
 URL: https://issues.apache.org/jira/browse/SPARK-6658
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Chet Mancini


A few methods under DataFrame incorrectly refer to the receiver as an RDD in 
their documentation.

* createJDBCTable
* insertIntoJDBC
* registerTempTable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6658) Incorrect DataFrame Documentation Type References


 [ 
https://issues.apache.org/jira/browse/SPARK-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chet Mancini updated SPARK-6658:

Labels: documentation  (was: docuentation)

 Incorrect DataFrame Documentation Type References
 -

 Key: SPARK-6658
 URL: https://issues.apache.org/jira/browse/SPARK-6658
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Chet Mancini
Priority: Trivial
  Labels: documentation
   Original Estimate: 5m
  Remaining Estimate: 5m

 A few methods under DataFrame incorrectly refer to the receiver as an RDD in 
 their documentation.
 * createJDBCTable
 * insertIntoJDBC
 * registerTempTable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6659) Spark SQL 1.3 cannot read json file that only with a record.

2015-04-01 Thread luochenghui (JIRA)

luochenghui created SPARK-6659:
--

 Summary: Spark SQL 1.3 cannot read json file that only with a 
record.
 Key: SPARK-6659
 URL: https://issues.apache.org/jira/browse/SPARK-6659
 Project: Spark
  Issue Type: Bug
Reporter: luochenghui


Dear friends:
 
Spark SQL 1.3 cannot read json file that only with a record.
here is my json file's content.
{name:milo,age,24}
 
when i run Spark SQL under the local mode,it throws an exception
rg.apache.spark.sql.AnalysisException: cannot resolve 'name' given input 
columns _corrupt_record;
 
what i had done:
1  ./spark-shell
2 
scala val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = 
org.apache.spark.sql.SQLContext@5f3be6c8
 
scala val df = sqlContext.jsonFile(/home/milo/person.json)
15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(163705) called with 
curMem=0, maxMem=280248975
15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 159.9 KB, free 267.1 MB)
15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(22692) called with 
curMem=163705, maxMem=280248975
15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 22.2 KB, free 267.1 MB)
15/03/19 22:11:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
localhost:35842 (size: 22.2 KB, free: 267.2 MB)
15/03/19 22:11:45 INFO BlockManagerMaster: Updated info of block 
broadcast_0_piece0
15/03/19 22:11:45 INFO SparkContext: Created broadcast 0 from textFile at 
JSONRelation.scala:98
15/03/19 22:11:47 INFO FileInputFormat: Total input paths to process : 1
15/03/19 22:11:47 INFO SparkContext: Starting job: reduce at JsonRDD.scala:51
15/03/19 22:11:47 INFO DAGScheduler: Got job 0 (reduce at JsonRDD.scala:51) 
with 1 output partitions (allowLocal=false)
15/03/19 22:11:47 INFO DAGScheduler: Final stage: Stage 0(reduce at 
JsonRDD.scala:51)
15/03/19 22:11:47 INFO DAGScheduler: Parents of final stage: List()
15/03/19 22:11:47 INFO DAGScheduler: Missing parents: List()
15/03/19 22:11:47 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[3] at 
map at JsonRDD.scala:51), which has no missing parents
15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(3184) called with 
curMem=186397, maxMem=280248975
15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 3.1 KB, free 267.1 MB)
15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(2251) called with 
curMem=189581, maxMem=280248975
15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in 
memory (estimated size 2.2 KB, free 267.1 MB)
15/03/19 22:11:47 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
localhost:35842 (size: 2.2 KB, free: 267.2 MB)
15/03/19 22:11:47 INFO BlockManagerMaster: Updated info of block 
broadcast_1_piece0
15/03/19 22:11:47 INFO SparkContext: Created broadcast 1 from broadcast at 
DAGScheduler.scala:839
15/03/19 22:11:48 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
(MapPartitionsRDD[3] at map at JsonRDD.scala:51)
15/03/19 22:11:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/03/19 22:11:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
localhost, PROCESS_LOCAL, 1291 bytes)
15/03/19 22:11:48 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/03/19 22:11:48 INFO HadoopRDD: Input split: file:/home/milo/person.json:0+26
15/03/19 22:11:48 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
mapreduce.task.id
15/03/19 22:11:48 INFO deprecation: mapred.task.id is deprecated. Instead, use 
mapreduce.task.attempt.id
15/03/19 22:11:48 INFO deprecation: mapred.task.is.map is deprecated. Instead, 
use mapreduce.task.ismap
15/03/19 22:11:48 INFO deprecation: mapred.task.partition is deprecated. 
Instead, use mapreduce.task.partition
15/03/19 22:11:48 INFO deprecation: mapred.job.id is deprecated. Instead, use 
mapreduce.job.id
15/03/19 22:11:49 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2023 
bytes result sent to driver
15/03/19 22:11:49 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
in 1209 ms on localhost (1/1)
15/03/19 22:11:49 INFO DAGScheduler: Stage 0 (reduce at JsonRDD.scala:51) 
finished in 1.308 s
15/03/19 22:11:49 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have 
all completed, from pool 
15/03/19 22:11:49 INFO DAGScheduler: Job 0 finished: reduce at 
JsonRDD.scala:51, took 2.002429 s
df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
 
3  
scala df.select(name).show()
15/03/19 22:12:44 INFO BlockManager: Removing broadcast 1
15/03/19 22:12:44 INFO BlockManager: Removing block broadcast_1_piece0
15/03/19 22:12:44 INFO MemoryStore: Block broadcast_1_piece0 of size 2251 
dropped from memory (free 280059394)
15/03/19 22:12:44 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 
localhost:35842 in memory (size: 2.2 KB,

[jira] [Assigned] (SPARK-6658) Incorrect DataFrame Documentation Type References


 [ 
https://issues.apache.org/jira/browse/SPARK-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6658:
---

Assignee: Apache Spark

 Incorrect DataFrame Documentation Type References
 -

 Key: SPARK-6658
 URL: https://issues.apache.org/jira/browse/SPARK-6658
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Chet Mancini
Assignee: Apache Spark
Priority: Trivial
  Labels: documentation
   Original Estimate: 5m
  Remaining Estimate: 5m

 A few methods under DataFrame incorrectly refer to the receiver as an RDD in 
 their documentation.
 * createJDBCTable
 * insertIntoJDBC
 * registerTempTable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6658) Incorrect DataFrame Documentation Type References


[ 
https://issues.apache.org/jira/browse/SPARK-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391563#comment-14391563
 ] 

Apache Spark commented on SPARK-6658:
-

User 'chetmancini' has created a pull request for this issue:
https://github.com/apache/spark/pull/5316

 Incorrect DataFrame Documentation Type References
 -

 Key: SPARK-6658
 URL: https://issues.apache.org/jira/browse/SPARK-6658
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Chet Mancini
Priority: Trivial
  Labels: documentation
   Original Estimate: 5m
  Remaining Estimate: 5m

 A few methods under DataFrame incorrectly refer to the receiver as an RDD in 
 their documentation.
 * createJDBCTable
 * insertIntoJDBC
 * registerTempTable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6658) Incorrect DataFrame Documentation Type References


 [ 
https://issues.apache.org/jira/browse/SPARK-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6658:
---

Assignee: (was: Apache Spark)

 Incorrect DataFrame Documentation Type References
 -

 Key: SPARK-6658
 URL: https://issues.apache.org/jira/browse/SPARK-6658
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Chet Mancini
Priority: Trivial
  Labels: documentation
   Original Estimate: 5m
  Remaining Estimate: 5m

 A few methods under DataFrame incorrectly refer to the receiver as an RDD in 
 their documentation.
 * createJDBCTable
 * insertIntoJDBC
 * registerTempTable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5989) Model import/export for LDAModel

2015-04-01 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390209#comment-14390209
 ] 

Manoj Kumar edited comment on SPARK-5989 at 4/1/15 10:04 PM:
-

[~josephkb] Can this be assigned to me? Thanks!


was (Author: mechcoder):
Can this be assigned to me? Thanks!

 Model import/export for LDAModel
 

 Key: SPARK-5989
 URL: https://issues.apache.org/jira/browse/SPARK-5989
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Add save/load for LDAModel and its local and distributed variants.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6657) Fix Python doc build warnings


 [ 
https://issues.apache.org/jira/browse/SPARK-6657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6657:
---

Assignee: Apache Spark  (was: Joseph K. Bradley)

 Fix Python doc build warnings
 -

 Key: SPARK-6657
 URL: https://issues.apache.org/jira/browse/SPARK-6657
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib, PySpark, SQL, Streaming
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Apache Spark
Priority: Trivial

 Reported by [~rxin]
 {code}
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainClassifier:15: ERROR: Unexpected 
 indentation.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainClassifier:16: WARNING: Block quote ends 
 without a blank line; unexpected unindent.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainClassifier:18: ERROR: Unexpected 
 indentation.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainClassifier:22: WARNING: Definition list 
 ends without a blank line; unexpected unindent.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainClassifier:28: WARNING: Definition list 
 ends without a blank line; unexpected unindent.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainRegressor:13: ERROR: Unexpected 
 indentation.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainRegressor:14: WARNING: Block quote ends 
 without a blank line; unexpected unindent.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainRegressor:16: ERROR: Unexpected 
 indentation.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainRegressor:18: ERROR: Unexpected 
 indentation.
 /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
 pyspark.sql.DataFrame.collect:1: WARNING: Inline interpreted text or phrase 
 reference start-string without end-string.
 /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
 pyspark.sql.DataFrame.orderBy:3: WARNING: Inline interpreted text or phrase 
 reference start-string without end-string.
 /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
 pyspark.sql.DataFrame.sort:3: WARNING: Inline interpreted text or phrase 
 reference start-string without end-string.
 /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
 pyspark.sql.DataFrame.take:1: WARNING: Inline interpreted text or phrase 
 reference start-string without end-string.
 /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title 
 underline too short.
 pyspark.streaming.kafka module
 
 /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title 
 underline too short.
 pyspark.streaming.kafka module
 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6657) Fix Python doc build warnings


[ 
https://issues.apache.org/jira/browse/SPARK-6657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391593#comment-14391593
 ] 

Apache Spark commented on SPARK-6657:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/5317

 Fix Python doc build warnings
 -

 Key: SPARK-6657
 URL: https://issues.apache.org/jira/browse/SPARK-6657
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib, PySpark, SQL, Streaming
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Trivial

 Reported by [~rxin]
 {code}
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainClassifier:15: ERROR: Unexpected 
 indentation.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainClassifier:16: WARNING: Block quote ends 
 without a blank line; unexpected unindent.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainClassifier:18: ERROR: Unexpected 
 indentation.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainClassifier:22: WARNING: Definition list 
 ends without a blank line; unexpected unindent.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainClassifier:28: WARNING: Definition list 
 ends without a blank line; unexpected unindent.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainRegressor:13: ERROR: Unexpected 
 indentation.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainRegressor:14: WARNING: Block quote ends 
 without a blank line; unexpected unindent.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainRegressor:16: ERROR: Unexpected 
 indentation.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainRegressor:18: ERROR: Unexpected 
 indentation.
 /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
 pyspark.sql.DataFrame.collect:1: WARNING: Inline interpreted text or phrase 
 reference start-string without end-string.
 /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
 pyspark.sql.DataFrame.orderBy:3: WARNING: Inline interpreted text or phrase 
 reference start-string without end-string.
 /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
 pyspark.sql.DataFrame.sort:3: WARNING: Inline interpreted text or phrase 
 reference start-string without end-string.
 /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
 pyspark.sql.DataFrame.take:1: WARNING: Inline interpreted text or phrase 
 reference start-string without end-string.
 /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title 
 underline too short.
 pyspark.streaming.kafka module
 
 /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title 
 underline too short.
 pyspark.streaming.kafka module
 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6657) Fix Python doc build warnings


 [ 
https://issues.apache.org/jira/browse/SPARK-6657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6657:
---

Assignee: Joseph K. Bradley  (was: Apache Spark)

 Fix Python doc build warnings
 -

 Key: SPARK-6657
 URL: https://issues.apache.org/jira/browse/SPARK-6657
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib, PySpark, SQL, Streaming
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Trivial

 Reported by [~rxin]
 {code}
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainClassifier:15: ERROR: Unexpected 
 indentation.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainClassifier:16: WARNING: Block quote ends 
 without a blank line; unexpected unindent.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainClassifier:18: ERROR: Unexpected 
 indentation.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainClassifier:22: WARNING: Definition list 
 ends without a blank line; unexpected unindent.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainClassifier:28: WARNING: Definition list 
 ends without a blank line; unexpected unindent.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainRegressor:13: ERROR: Unexpected 
 indentation.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainRegressor:14: WARNING: Block quote ends 
 without a blank line; unexpected unindent.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainRegressor:16: ERROR: Unexpected 
 indentation.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainRegressor:18: ERROR: Unexpected 
 indentation.
 /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
 pyspark.sql.DataFrame.collect:1: WARNING: Inline interpreted text or phrase 
 reference start-string without end-string.
 /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
 pyspark.sql.DataFrame.orderBy:3: WARNING: Inline interpreted text or phrase 
 reference start-string without end-string.
 /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
 pyspark.sql.DataFrame.sort:3: WARNING: Inline interpreted text or phrase 
 reference start-string without end-string.
 /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
 pyspark.sql.DataFrame.take:1: WARNING: Inline interpreted text or phrase 
 reference start-string without end-string.
 /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title 
 underline too short.
 pyspark.streaming.kafka module
 
 /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title 
 underline too short.
 pyspark.streaming.kafka module
 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6650) ExecutorAllocationManager never stops

2015-04-01 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-6650:
-

 Summary: ExecutorAllocationManager never stops
 Key: SPARK-6650
 URL: https://issues.apache.org/jira/browse/SPARK-6650
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Marcelo Vanzin


{{ExecutorAllocationManager}} doesn't even have a stop() method. That means 
that when the owning SparkContext goes away, the internal thread it uses to 
schedule its activities remains alive.

That means it constantly spams the logs and does who knows what else that could 
affect any future contexts that are allocated.

It's particularly evil during unit tests, since it slows down everything else 
after the suite is run, leaving multiple threads behind.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Deenar Toraskar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391335#comment-14391335
 ] 

Deenar Toraskar commented on SPARK-6646:


maybe Spark 2.0 should be branded i-Spark

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6642) Change the lambda weight to number of explicit ratings in implicit ALS


[ 
https://issues.apache.org/jira/browse/SPARK-6642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391412#comment-14391412
 ] 

Apache Spark commented on SPARK-6642:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/5314

 Change the lambda weight to number of explicit ratings in implicit ALS
 --

 Key: SPARK-6642
 URL: https://issues.apache.org/jira/browse/SPARK-6642
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 Until SPARK-6637 is resolved, we should switch back to the 1.2 lambda 
 weighting strategy to be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6642) Change the lambda weight to number of explicit ratings in implicit ALS


 [ 
https://issues.apache.org/jira/browse/SPARK-6642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6642:
---

Assignee: Xiangrui Meng  (was: Apache Spark)

 Change the lambda weight to number of explicit ratings in implicit ALS
 --

 Key: SPARK-6642
 URL: https://issues.apache.org/jira/browse/SPARK-6642
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 Until SPARK-6637 is resolved, we should switch back to the 1.2 lambda 
 weighting strategy to be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6642) Change the lambda weight to number of explicit ratings in implicit ALS


 [ 
https://issues.apache.org/jira/browse/SPARK-6642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6642:
---

Assignee: Apache Spark  (was: Xiangrui Meng)

 Change the lambda weight to number of explicit ratings in implicit ALS
 --

 Key: SPARK-6642
 URL: https://issues.apache.org/jira/browse/SPARK-6642
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Apache Spark

 Until SPARK-6637 is resolved, we should switch back to the 1.2 lambda 
 weighting strategy to be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6373) Add SSL/TLS for the Netty based BlockTransferService

2015-04-01 Thread Jeffrey Turpin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391452#comment-14391452
 ] 

Jeffrey Turpin commented on SPARK-6373:
---

Hey Aaron,

Sorry for the delay... I have cleaned things up a bit and refactored the 
implementation to be more inline with our earlier conversation... Have a look 
at 
https://github.com/turp1twin/spark/commit/d976a7ab9b57e26fc180d649fd084a6acb9d027e
 and let me know your thoughts...

Jeff


 Add SSL/TLS for the Netty based BlockTransferService 
 -

 Key: SPARK-6373
 URL: https://issues.apache.org/jira/browse/SPARK-6373
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Shuffle
Affects Versions: 1.2.1
Reporter: Jeffrey Turpin

 Add the ability to allow for secure communications (SSL/TLS) for the Netty 
 based BlockTransferService and the ExternalShuffleClient. This ticket will 
 hopefully start the conversation around potential designs... Below is a 
 reference to a WIP prototype which implements this functionality 
 (prototype)... I have attempted to disrupt as little code as possible and 
 tried to follow the current code structure (for the most part) in the areas I 
 modified. I also studied how Hadoop achieves encrypted shuffle 
 (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html)
 https://github.com/turp1twin/spark/commit/024b559f27945eb63068d1badf7f82e4e7c3621c



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391456#comment-14391456
 ] 

Matei Zaharia commented on SPARK-6646:
--

Not to rain on the parade here, but I worry that focusing on mobile phones is 
short-sighted. Does this design present a path forward for the Internet of 
Things as well? You'd want something that runs on Arduino, Raspberry Pi, etc. 
We already have MQTT input in Spark Streaming so we could consider using MQTT 
to replace Netty for shuffle as well. Has anybody benchmarked that?

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6651) Delegate dense vector arithmetics to the underly numpy array


 [ 
https://issues.apache.org/jira/browse/SPARK-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6651.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1

Issue resolved by pull request 5312
[https://github.com/apache/spark/pull/5312]

 Delegate dense vector arithmetics to the underly numpy array
 

 Key: SPARK-6651
 URL: https://issues.apache.org/jira/browse/SPARK-6651
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.3.1, 1.4.0


 It is convenient to delegate dense linear algebra operations to numpy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6658) Incorrect DataFrame Documentation Type References


 [ 
https://issues.apache.org/jira/browse/SPARK-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chet Mancini resolved SPARK-6658.
-
Resolution: Implemented

 Incorrect DataFrame Documentation Type References
 -

 Key: SPARK-6658
 URL: https://issues.apache.org/jira/browse/SPARK-6658
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Chet Mancini
Priority: Trivial
  Labels: documentation
   Original Estimate: 5m
  Remaining Estimate: 5m

 A few methods under DataFrame incorrectly refer to the receiver as an RDD in 
 their documentation.
 * createJDBCTable
 * insertIntoJDBC
 * registerTempTable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-04-01 Thread Neelesh Srinivas Salian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391623#comment-14391623
 ] 

Neelesh Srinivas Salian commented on SPARK-2243:


I hit this error. Simply closed the previous context. Any other workaround?


 Support multiple SparkContexts in the same JVM
 --

 Key: SPARK-2243
 URL: https://issues.apache.org/jira/browse/SPARK-2243
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Spark Core
Affects Versions: 0.7.0, 1.0.0, 1.1.0
Reporter: Miguel Angel Fernandez Diaz

 We're developing a platform where we create several Spark contexts for 
 carrying out different calculations. Is there any restriction when using 
 several Spark contexts? We have two contexts, one for Spark calculations and 
 another one for Spark Streaming jobs. The next error arises when we first 
 execute a Spark calculation and, once the execution is finished, a Spark 
 Streaming job is launched:
 {code}
 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
   at 
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
   at 
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
 java.io.FileNotFoundException
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-04-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391629#comment-14391629
 ] 

Sean Owen commented on SPARK-2243:
--

Sorry to be flippant but really the answer is to not make multiple 
SparkContexts. Simply run in separate JVMs, or share access to one SparkContext 
in the JVM.

 Support multiple SparkContexts in the same JVM
 --

 Key: SPARK-2243
 URL: https://issues.apache.org/jira/browse/SPARK-2243
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Spark Core
Affects Versions: 0.7.0, 1.0.0, 1.1.0
Reporter: Miguel Angel Fernandez Diaz

 We're developing a platform where we create several Spark contexts for 
 carrying out different calculations. Is there any restriction when using 
 several Spark contexts? We have two contexts, one for Spark calculations and 
 another one for Spark Streaming jobs. The next error arises when we first 
 execute a Spark calculation and, once the execution is finished, a Spark 
 Streaming job is launched:
 {code}
 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
   at 
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
   at 
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
 java.io.FileNotFoundException
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at

[jira] [Assigned] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler


 [ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5556:
---

Assignee: Pedro Rodriguez  (was: Apache Spark)

 Latent Dirichlet Allocation (LDA) using Gibbs sampler 
 --

 Key: SPARK-5556
 URL: https://issues.apache.org/jira/browse/SPARK-5556
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Pedro Rodriguez





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler


 [ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5556:
---

Assignee: Apache Spark  (was: Pedro Rodriguez)

 Latent Dirichlet Allocation (LDA) using Gibbs sampler 
 --

 Key: SPARK-5556
 URL: https://issues.apache.org/jira/browse/SPARK-5556
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Tathagata Das (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391636#comment-14391636
]

Tathagata Das commented on SPARK-6646:
--

I vehemently disagree. I dont think we should choose names that subtly
indicates Spark runs on IPhone only. That is frankly not true. We want to
embrace all platforms without any bias.

Spark 2.0: Rearchitecting Spark for Mobile Platforms

Key: SPARK-6646
URL: https://issues.apache.org/jira/browse/SPARK-6646
Project: Spark
Issue Type: Improvement
Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
Attachments: Spark on Mobile - Design Doc - v1.pdf

Mobile computing is quickly rising to dominance, and by the end of 2017, it
is estimated that 90% of CPU cycles will be devoted to mobile hardware.
Spark’s project goal can be accomplished only when Spark runs efficiently for
the growing population of mobile users.
Designed and optimized for modern data centers and Big Data applications,
Spark is unfortunately not a good fit for mobile computing today. In the past
few months, we have been prototyping the feasibility of a mobile-first Spark
architecture, and today we would like to share with you our findings. This
ticket outlines the technical design of Spark’s mobile support, and shares
results from several early prototypes.
Mobile friendly version of the design doc:
https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Venkat Krishnamurthy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391683#comment-14391683
 ] 

Venkat Krishnamurthy commented on SPARK-6646:
-

I'm looking forward to the release that targets smart watches. It could have 
the pleasant side effect of making time stand still while executors crunch away 
in the background, obviating any need for performance tuning.

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6660) MLLibPythonAPI.pythonToJava doesn't recognize object arrays