[jira] [Updated] (SPARK-6391) Update Tachyon version compatibility documentation

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6391:
-
Target Version/s: 1.4.0
   Fix Version/s: (was: 1.4.0)

[~haoyuan] we set Fix Version when the issue is Resolved. At best, set Target 
Version.

 Update Tachyon version compatibility documentation
 --

 Key: SPARK-6391
 URL: https://issues.apache.org/jira/browse/SPARK-6391
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.3.0
Reporter: Calvin Jia

 Tachyon v0.6 has an api change in the client, it would be helpful to document 
 the Tachyon-Spark compatibility across versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6391) Update Tachyon version compatibility documentation

2015-03-28 Thread Haoyuan Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoyuan Li updated SPARK-6391:
--
Fix Version/s: 1.4.0

 Update Tachyon version compatibility documentation
 --

 Key: SPARK-6391
 URL: https://issues.apache.org/jira/browse/SPARK-6391
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.3.0
Reporter: Calvin Jia
 Fix For: 1.4.0


 Tachyon v0.6 has an api change in the client, it would be helpful to document 
 the Tachyon-Spark compatibility across versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6299) ClassNotFoundException in standalone mode when running groupByKey with class defined in REPL.

2015-03-28 Thread Chip Senkbeil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385516#comment-14385516
 ] 

Chip Senkbeil commented on SPARK-6299:
--

FYI, we had the same issue on Mesos for 1.2.1 when the class was defined 
through the REPL. So, it was not just limited to standalone mode.

 ClassNotFoundException in standalone mode when running groupByKey with class 
 defined in REPL.
 -

 Key: SPARK-6299
 URL: https://issues.apache.org/jira/browse/SPARK-6299
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.1, 1.3.0
Reporter: Kevin (Sangwoo) Kim
Assignee: Kevin (Sangwoo) Kim
 Fix For: 1.3.1, 1.4.0


 Anyone can reproduce this issue by the code below
 (runs well in local mode, got exception with clusters)
 (it runs well in Spark 1.1.1)
 {code}
 case class ClassA(value: String)
 val rdd = sc.parallelize(List((k1, ClassA(v1)), (k1, ClassA(v2)) ))
 rdd.groupByKey.collect
 {code}
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 
 in stage 1.0 failed 4 times, most recent failure: Lost task 162.3 in stage 
 1.0 (TID 1027, ip-172-16-182-27.ap-northeast-1.compute.internal): 
 java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$UserRelationshipRow
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:274)
 at 
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
 at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
 at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
 at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 at 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
 at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:91)
 at 
 org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
 at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

[jira] [Commented] (SPARK-6391) Update Tachyon version compatibility documentation

2015-03-28 Thread Haoyuan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385584#comment-14385584
 ] 

Haoyuan Li commented on SPARK-6391:
---

Thanks [~sowen].

 Update Tachyon version compatibility documentation
 --

 Key: SPARK-6391
 URL: https://issues.apache.org/jira/browse/SPARK-6391
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.3.0
Reporter: Calvin Jia

 Tachyon v0.6 has an api change in the client, it would be helpful to document 
 the Tachyon-Spark compatibility across versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6589) SQLUserDefinedType failed in spark-shell

2015-03-28 Thread Benyi Wang (JIRA)
Benyi Wang created SPARK-6589:
-

 Summary: SQLUserDefinedType failed in spark-shell
 Key: SPARK-6589
 URL: https://issues.apache.org/jira/browse/SPARK-6589
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
 Environment: CDH 5.3.2
Reporter: Benyi Wang


{{DataType.fromJson}} will fail in spark-shell if the schema includes udt. It 
works if running in an application. 

This causes that I cannot read a parquet file including a UDT field. 
{{DataType.fromCaseClass}} does not support UDT.

I can load the class which shows that my UDT is in the classpath.
{code}
scala Class.forName(com.bwang.MyTestUDT)
res6: Class[_] = class com.bwang.MyTestUDT
{code}

But DataType fails:
{code}
scala DataType.fromJson(json)  

java.lang.ClassNotFoundException: com.bwang.MyTestUDT
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at 
org.apache.spark.sql.catalyst.types.DataType$.parseDataType(dataTypes.scala:77)
{code}

The reason is DataType.fromJson tries to load {{udtClass}} using this code:
{code}
case JSortedObject(
(class, JString(udtClass)),
(pyClass, _),
(sqlType, _),
(type, JString(udt))) =
  Class.forName(udtClass).newInstance().asInstanceOf[UserDefinedType[_]]
  }
{code}

Unfortunately, my UDT is loaded by {{SparkIMain$TranslatingClassLoader}}, but 
DataType is loaded by {{Launcher$AppClassLoader}}.

{code}
scala DataType.getClass.getClassLoader
res2: ClassLoader = sun.misc.Launcher$AppClassLoader@6876fb1b

scala this.getClass.getClassLoader
res3: ClassLoader = 
org.apache.spark.repl.SparkIMain$TranslatingClassLoader@63d36b29
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5124) Standardize internal RPC interface

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5124:
---

Assignee: Shixiong Zhu  (was: Apache Spark)

 Standardize internal RPC interface
 --

 Key: SPARK-5124
 URL: https://issues.apache.org/jira/browse/SPARK-5124
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu
 Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf


 In Spark we use Akka as the RPC layer. It would be great if we can 
 standardize the internal RPC interface to facilitate testing. This will also 
 provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5124) Standardize internal RPC interface

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5124:
---

Assignee: Apache Spark  (was: Shixiong Zhu)

 Standardize internal RPC interface
 --

 Key: SPARK-5124
 URL: https://issues.apache.org/jira/browse/SPARK-5124
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Apache Spark
 Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf


 In Spark we use Akka as the RPC layer. It would be great if we can 
 standardize the internal RPC interface to facilitate testing. This will also 
 provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5494) SparkSqlSerializer Ignores KryoRegistrators

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5494:
---

Assignee: Apache Spark

 SparkSqlSerializer Ignores KryoRegistrators
 ---

 Key: SPARK-5494
 URL: https://issues.apache.org/jira/browse/SPARK-5494
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Hamel Ajay Kothari
Assignee: Apache Spark

 We should make SparkSqlSerializer call {{super.newKryo}} before doing any of 
 it's custom stuff in order to make sure it picks up on custom 
 KryoRegistrators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5494) SparkSqlSerializer Ignores KryoRegistrators

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5494:
---

Assignee: (was: Apache Spark)

 SparkSqlSerializer Ignores KryoRegistrators
 ---

 Key: SPARK-5494
 URL: https://issues.apache.org/jira/browse/SPARK-5494
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Hamel Ajay Kothari

 We should make SparkSqlSerializer call {{super.newKryo}} before doing any of 
 it's custom stuff in order to make sure it picks up on custom 
 KryoRegistrators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5946) Add Python API for Kafka direct stream

2015-03-28 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5946:
-
Target Version/s: 1.4.0

 Add Python API for Kafka direct stream
 --

 Key: SPARK-5946
 URL: https://issues.apache.org/jira/browse/SPARK-5946
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Streaming
Affects Versions: 1.3.0
Reporter: Saisai Shao

 Add the Python API for Kafka direct stream. Currently only adds 
 {{createDirectStream}} API, no {{createRDD}} API, since it needs some Python 
 wraps of Java object, will improve this according to the comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6369) InsertIntoHiveTable should use logic from SparkHadoopWriter

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6369:
---

Assignee: Apache Spark  (was: Cheng Lian)

 InsertIntoHiveTable should use logic from SparkHadoopWriter
 ---

 Key: SPARK-6369
 URL: https://issues.apache.org/jira/browse/SPARK-6369
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Apache Spark
Priority: Blocker

 Right now it is possible that we will corrupt the output if there is a race 
 between competing speculative tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6369) InsertIntoHiveTable should use logic from SparkHadoopWriter

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6369:
---

Assignee: Cheng Lian  (was: Apache Spark)

 InsertIntoHiveTable should use logic from SparkHadoopWriter
 ---

 Key: SPARK-6369
 URL: https://issues.apache.org/jira/browse/SPARK-6369
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker

 Right now it is possible that we will corrupt the output if there is a race 
 between competing speculative tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc

2015-03-28 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385616#comment-14385616
 ] 

Nan Zhu commented on SPARK-6592:


also cc: [~lian cheng] [~marmbrus]

 API of Row trait should be presented in Scala doc
 -

 Key: SPARK-6592
 URL: https://issues.apache.org/jira/browse/SPARK-6592
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Nan Zhu

 Currently, the API of Row class is not presented in Scaladoc, though we have 
 many chances to use it 
 the reason is that we ignore all files under catalyst directly in 
 SparkBuild.scala when generating Scaladoc, 
 (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369)
 What's the best approach to fix this? [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc

2015-03-28 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385617#comment-14385617
 ] 

Reynold Xin commented on SPARK-6592:


Can you try change that line to 

spark/sql/catalyst?

then it should only filter out the catalyst package, but not the catalyst 
module.


 API of Row trait should be presented in Scala doc
 -

 Key: SPARK-6592
 URL: https://issues.apache.org/jira/browse/SPARK-6592
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Nan Zhu

 Currently, the API of Row class is not presented in Scaladoc, though we have 
 many chances to use it 
 the reason is that we ignore all files under catalyst directly in 
 SparkBuild.scala when generating Scaladoc, 
 (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369)
 What's the best approach to fix this? [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6591) Python data source load options should auto convert common types into strings

2015-03-28 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-6591:
--

 Summary: Python data source load options should auto convert 
common types into strings
 Key: SPARK-6591
 URL: https://issues.apache.org/jira/browse/SPARK-6591
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Reporter: Reynold Xin
Assignee: Davies Liu


See the discussion at : https://github.com/databricks/spark-csv/pull/39

If the caller invokes
{code}
sqlContext.load(com.databricks.spark.csv, path = cars.csv, header = True)
{code}

We should automatically turn header into true in string form.

We should do this for booleans and numeric values.

cc [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6591) Python data source load options should auto convert common types into strings

2015-03-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6591:
---
Labels: DataFrame DataSource  (was: )

 Python data source load options should auto convert common types into strings
 -

 Key: SPARK-6591
 URL: https://issues.apache.org/jira/browse/SPARK-6591
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Reporter: Reynold Xin
Assignee: Davies Liu
  Labels: DataFrame, DataSource

 See the discussion at : https://github.com/databricks/spark-csv/pull/39
 If the caller invokes
 {code}
 sqlContext.load(com.databricks.spark.csv, path = cars.csv, header = True)
 {code}
 We should automatically turn header into true in string form.
 We should do this for booleans and numeric values.
 cc [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6592) API of Row trait should be presented in Scala doc

2015-03-28 Thread Nan Zhu (JIRA)
Nan Zhu created SPARK-6592:
--

 Summary: API of Row trait should be presented in Scala doc
 Key: SPARK-6592
 URL: https://issues.apache.org/jira/browse/SPARK-6592
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Nan Zhu


Currently, the API of Row class is not presented in Scaladoc, though we have 
many chances to use it 

the reason is that we ignore all files under catalyst directly in 
SparkBuild.scala when generating Scaladoc, 
(https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369)

What's the best approach to fix this? [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2973) Use LocalRelation for all ExecutedCommands, avoid job for take/collect()

2015-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385606#comment-14385606
 ] 

Apache Spark commented on SPARK-2973:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/5247

 Use LocalRelation for all ExecutedCommands, avoid job for take/collect()
 

 Key: SPARK-2973
 URL: https://issues.apache.org/jira/browse/SPARK-2973
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Aaron Davidson
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.2.0


 Right now, sql(show tables).collect() will start a Spark job which shows up 
 in the UI. There should be a way to get these without this step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6575) Add configuration to disable schema merging while converting metastore Parquet tables

2015-03-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6575:
--
Description: 
Consider a metastore Parquet table that
# doesn't have schema evolution issue
# has lots of data files and/or partitions

In this case, driver schema merging can be both slow and unnecessary. Would be 
good to have a configuration to let the use disable schema merging when 
converting such a metastore Parquet table.

  was:
Consider a metastore Parquet table that
# doesn't have schema evolution issue
# has lots of data files and/or partitions

In this case, driver schema merging can be both slow and unnecessary. Would be 
good to have a configuration to let the use disable schema merging when 
coverting such a metastore Parquet table.


 Add configuration to disable schema merging while converting metastore 
 Parquet tables
 -

 Key: SPARK-6575
 URL: https://issues.apache.org/jira/browse/SPARK-6575
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian

 Consider a metastore Parquet table that
 # doesn't have schema evolution issue
 # has lots of data files and/or partitions
 In this case, driver schema merging can be both slow and unnecessary. Would 
 be good to have a configuration to let the use disable schema merging when 
 converting such a metastore Parquet table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6590) Make DataFrame.where accept a string conditionExpr

2015-03-28 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6590:

Priority: Minor  (was: Major)

 Make DataFrame.where accept a string conditionExpr
 --

 Key: SPARK-6590
 URL: https://issues.apache.org/jira/browse/SPARK-6590
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Minor

 In our doc, we say where is an alias of filter. However, where does not 
 support a conditionExpr in string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6590) Make DataFrame.where accept a string conditionExpr

2015-03-28 Thread Yin Huai (JIRA)
Yin Huai created SPARK-6590:
---

 Summary: Make DataFrame.where accept a string conditionExpr
 Key: SPARK-6590
 URL: https://issues.apache.org/jira/browse/SPARK-6590
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yin Huai
Assignee: Yin Huai


In our doc, we say where is an alias of filter. However, where does not support 
a conditionExpr in string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6589) SQLUserDefinedType failed in spark-shell

2015-03-28 Thread Benyi Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385499#comment-14385499
 ] 

Benyi Wang commented on SPARK-6589:
---

I found a method to fix this issue. But I still think DataType should find a 
better way to find the correct class loader.
{code}
# put the UDT jar to SPARK_CLASSPATH so that Launcher$AppClassLoader can find 
it.
export SPARK_CLASSPATH=myUDT.jar

spark-shell --jars myUDT.jar ...
{code}

 SQLUserDefinedType failed in spark-shell
 

 Key: SPARK-6589
 URL: https://issues.apache.org/jira/browse/SPARK-6589
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
 Environment: CDH 5.3.2
Reporter: Benyi Wang

 {{DataType.fromJson}} will fail in spark-shell if the schema includes udt. 
 It works if running in an application. 
 This causes that I cannot read a parquet file including a UDT field. 
 {{DataType.fromCaseClass}} does not support UDT.
 I can load the class which shows that my UDT is in the classpath.
 {code}
 scala Class.forName(com.bwang.MyTestUDT)
 res6: Class[_] = class com.bwang.MyTestUDT
 {code}
 But DataType fails:
 {code}
 scala DataType.fromJson(json)
   
 java.lang.ClassNotFoundException: com.bwang.MyTestUDT
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:190)
 at 
 org.apache.spark.sql.catalyst.types.DataType$.parseDataType(dataTypes.scala:77)
 {code}
 The reason is DataType.fromJson tries to load {{udtClass}} using this code:
 {code}
 case JSortedObject(
 (class, JString(udtClass)),
 (pyClass, _),
 (sqlType, _),
 (type, JString(udt))) =
   Class.forName(udtClass).newInstance().asInstanceOf[UserDefinedType[_]]
   }
 {code}
 Unfortunately, my UDT is loaded by {{SparkIMain$TranslatingClassLoader}}, but 
 DataType is loaded by {{Launcher$AppClassLoader}}.
 {code}
 scala DataType.getClass.getClassLoader
 res2: ClassLoader = sun.misc.Launcher$AppClassLoader@6876fb1b
 scala this.getClass.getClassLoader
 res3: ClassLoader = 
 org.apache.spark.repl.SparkIMain$TranslatingClassLoader@63d36b29
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame

2015-03-28 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-6586:
--

 Summary: Add the capability of retrieving original logical plan of 
DataFrame
 Key: SPARK-6586
 URL: https://issues.apache.org/jira/browse/SPARK-6586
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor


In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan 
instead of logical plan. However, by doing that we can't know the logical plan 
of a {{DataFrame}}. But it might be still useful and important to retrieve the 
original logical plan in some use cases.

In this pr, we introduce the capability of retrieving original logical plan of 
{{DataFrame}}.

The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once 
{{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as 
{{true}}.  In {{QueryExecution}}, we keep the original logical plan in the 
analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to 
recursively replace the analyzed logical plan with original logical plan and 
retrieve it.

Besides the capability of retrieving original logical plan, this modification 
also can avoid do plan analysis if it is already analyzed.
 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame

2015-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385261#comment-14385261
 ] 

Apache Spark commented on SPARK-6586:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/5241

 Add the capability of retrieving original logical plan of DataFrame
 ---

 Key: SPARK-6586
 URL: https://issues.apache.org/jira/browse/SPARK-6586
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor

 In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan 
 instead of logical plan. However, by doing that we can't know the logical 
 plan of a {{DataFrame}}. But it might be still useful and important to 
 retrieve the original logical plan in some use cases.
 In this pr, we introduce the capability of retrieving original logical plan 
 of {{DataFrame}}.
 The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once 
 {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as 
 {{true}}.  In {{QueryExecution}}, we keep the original logical plan in the 
 analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to 
 recursively replace the analyzed logical plan with original logical plan and 
 retrieve it.
 Besides the capability of retrieving original logical plan, this modification 
 also can avoid do plan analysis if it is already analyzed.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6586:
---

Assignee: Apache Spark

 Add the capability of retrieving original logical plan of DataFrame
 ---

 Key: SPARK-6586
 URL: https://issues.apache.org/jira/browse/SPARK-6586
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Apache Spark
Priority: Minor

 In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan 
 instead of logical plan. However, by doing that we can't know the logical 
 plan of a {{DataFrame}}. But it might be still useful and important to 
 retrieve the original logical plan in some use cases.
 In this pr, we introduce the capability of retrieving original logical plan 
 of {{DataFrame}}.
 The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once 
 {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as 
 {{true}}.  In {{QueryExecution}}, we keep the original logical plan in the 
 analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to 
 recursively replace the analyzed logical plan with original logical plan and 
 retrieve it.
 Besides the capability of retrieving original logical plan, this modification 
 also can avoid do plan analysis if it is already analyzed.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4941) Yarn cluster mode does not upload all needed jars to driver node (Spark 1.2.0)

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4941.
--
Resolution: Cannot Reproduce

OK, we can reopen if this if typos etc are ruled out, and it is reproducible vs 
at least 1.3.0.

 Yarn cluster mode does not upload all needed jars to driver node (Spark 1.2.0)
 --

 Key: SPARK-4941
 URL: https://issues.apache.org/jira/browse/SPARK-4941
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Gurpreet Singh

 I am specifying additional jars and config xml file with --jars and --files 
 option to be uploaded to driver in the following spark-submit command. 
 However they are not getting uploaded.
 This results in the the job failure. It was working in spark 1.0.2 build.
 Spark-Build being used (spark-1.2.0.tgz)
 
 $SPARK_HOME/bin/spark-submit \
 --class com.ebay.inc.scala.testScalaXML \
 --driver-class-path 
 /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1--2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop/share/hadoop/common/lib/hadoop--0.1--2.jar:/apache/hive/lib/mysql-connector-java-5.0.8-bin.jar:/apache/hadoop/share/hadoop/common/lib/guava-11.0.2.jar
  \
 --master yarn \
 --deploy-mode cluster \
 --num-executors 3 \
 --driver-memory 1G  \
 --executor-memory 1G \
 /export/home/b_incdata_rw/gurpreetsingh/jar/testscalaxml_2.11-1.0.jar 
 /export/home/b_incdata_rw/gurpreetsingh/sqlFramework.xml next_gen_linking \
 --queue hdmi-spark \
 --jars 
 /export/home/b_incdata_rw/gurpreetsingh/jar/datanucleus-api-jdo-3.2.1.jar,/export/home/b_incdata_rw/gurpreetsingh/jar/datanucleus-core-3.2.2.jar,/export/home/b_incdata_rw/gurpreetsingh/jar/datanucleus-rdbms-3.2.1.jar,/apache/hive/lib/mysql-connector-java-5.0.8-bin.jar,/apache/hadoop/share/hadoop/common/lib/hadoop--0.1--2.jar,/apache/hadoop/share/hadoop/common/lib/hadoop-lzo-0.6.0.jar,/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1--2.jar\
 --files 
 /export/home/b_incdata_rw/gurpreetsingh/spark-1.0.2-bin-2.4.1/conf/hive-site.xml
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 14/12/22 23:00:17 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
 to rm2
 14/12/22 23:00:17 INFO yarn.Client: Requesting a new application from cluster 
 with 2026 NodeManagers
 14/12/22 23:00:17 INFO yarn.Client: Verifying our application has not 
 requested more than the maximum memory capability of the cluster (16384 MB 
 per container)
 14/12/22 23:00:17 INFO yarn.Client: Will allocate AM container, with 1408 MB 
 memory including 384 MB overhead
 14/12/22 23:00:17 INFO yarn.Client: Setting up container launch context for 
 our AM
 14/12/22 23:00:17 INFO yarn.Client: Preparing resources for our AM container
 14/12/22 23:00:18 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 14/12/22 23:00:18 WARN hdfs.BlockReaderLocal: The short-circuit local reads 
 feature cannot be used because libhadoop cannot be loaded.
 14/12/22 23:00:21 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 
 6623380 for b_incdata_rw on 10.115.201.75:8020
 14/12/22 23:00:21 INFO yarn.Client: 
 Uploading resource 
 file:/home/b_incdata_rw/gurpreetsingh/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar
  - 
 hdfs://-nn.vip.xxx.com:8020/user/b_incdata_rw/.sparkStaging/application_1419242629195_8432/spark-assembly-1.2.0-hadoop2.4.0.jar
 14/12/22 23:00:24 INFO yarn.Client: Uploading resource 
 file:/export/home/b_incdata_rw/gurpreetsingh/jar/firstsparkcode_2.11-1.0.jar 
 - 
 hdfs://-nn.vip.xxx.com:8020:8020/user/b_incdata_rw/.sparkStaging/application_1419242629195_8432/firstsparkcode_2.11-1.0.jar
 14/12/22 23:00:25 INFO yarn.Client: Setting up the launch environment for our 
 AM container



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6552) expose start-slave.sh to user and update outdated doc

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6552.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5205
[https://github.com/apache/spark/pull/5205]

 expose start-slave.sh to user and update outdated doc
 -

 Key: SPARK-6552
 URL: https://issues.apache.org/jira/browse/SPARK-6552
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Documentation
Reporter: Tao Wang
Priority: Minor
 Fix For: 1.4.0


 It would be better to expose start-slave.sh to user to allow starting a 
 worker on single node.
 As the description for starting a worker in document is in foregroud way, I 
 alse changed it to backgroud way(using start-slave.sh).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6571) MatrixFactorizationModel created by load fails on predictAll

2015-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385414#comment-14385414
 ] 

Apache Spark commented on SPARK-6571:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/5243

 MatrixFactorizationModel created by load fails on predictAll
 

 Key: SPARK-6571
 URL: https://issues.apache.org/jira/browse/SPARK-6571
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Charles Hayden
Assignee: Xiangrui Meng

 This code, adapted from the documentation, fails when using a loaded model.
 from pyspark.mllib.recommendation import ALS, Rating, MatrixFactorizationModel
 r1 = (1, 1, 1.0)
 r2 = (1, 2, 2.0)
 r3 = (2, 1, 2.0)
 ratings = sc.parallelize([r1, r2, r3])
 model = ALS.trainImplicit(ratings, 1, seed=10)
 print '(2, 2)', model.predict(2, 2)
 #0.43...
 testset = sc.parallelize([(1, 2), (1, 1)])
 print 'all', model.predictAll(testset).collect()
 #[Rating(user=1, product=1, rating=1.0...), Rating(user=1, product=2, 
 rating=1.9...)]
 import os, tempfile
 path = tempfile.mkdtemp()
 model.save(sc, path)
 sameModel = MatrixFactorizationModel.load(sc, path)
 print '(2, 2)', sameModel.predict(2,2)
 sameModel.predictAll(testset).collect()
 This gives
 (2, 2) 0.443547642944
 all [Rating(user=1, product=1, rating=1.1538351103381217), Rating(user=1, 
 product=2, rating=0.7153473708381739)]
 (2, 2) 0.443547642944
 ---
 Py4JError Traceback (most recent call last)
 ipython-input-18-af6612bed9d0 in module()
  19 sameModel = MatrixFactorizationModel.load(sc, path)
  20 print '(2, 2)', sameModel.predict(2,2)
 --- 21 sameModel.predictAll(testset).collect()
  22 
 /home/ubuntu/spark/python/pyspark/mllib/recommendation.pyc in 
 predictAll(self, user_product)
 104 assert len(first) == 2, user_product should be RDD of (user, 
 product)
 105 user_product = user_product.map(lambda (u, p): (int(u), 
 int(p)))
 -- 106 return self.call(predict, user_product)
 107 
 108 def userFeatures(self):
 /home/ubuntu/spark/python/pyspark/mllib/common.pyc in call(self, name, *a)
 134 def call(self, name, *a):
 135 Call method of java_model
 -- 136 return callJavaFunc(self._sc, getattr(self._java_model, 
 name), *a)
 137 
 138 
 /home/ubuntu/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func, 
 *args)
 111  Call Java Function 
 112 args = [_py2java(sc, a) for a in args]
 -- 113 return _java2py(sc, func(*args))
 114 
 115 
 /home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
 __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer, self.gateway_client,
 -- 538 self.target_id, self.name)
 539 
 540 for temp_arg in temp_args:
 /home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
 get_return_value(answer, gateway_client, target_id, name)
 302 raise Py4JError(
 303 'An error occurred while calling {0}{1}{2}. 
 Trace:\n{3}\n'.
 -- 304 format(target_id, '.', name, value))
 305 else:
 306 raise Py4JError(
 Py4JError: An error occurred while calling o450.predict. Trace:
 py4j.Py4JException: Method predict([class org.apache.spark.api.java.JavaRDD]) 
 does not exist
   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
   at py4j.Gateway.invoke(Gateway.java:252)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.GatewayConnection.run(GatewayConnection.java:207)
   at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6581) Metadata is missing when saving parquet file using hadoop 1.0.4

2015-03-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6581:
--
Target Version/s: 1.4.0

 Metadata is missing when saving parquet file using hadoop 1.0.4
 ---

 Key: SPARK-6581
 URL: https://issues.apache.org/jira/browse/SPARK-6581
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: hadoop 1.0.4
Reporter: Pei-Lun Lee

 When saving parquet file with {code}df.save(foo, parquet){code}
 It generates only _common_data while _metadata is missing:
 {noformat}
 -rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
 -rwxrwxrwx  1 peilunlee  staff  250 Mar 27 11:29 _common_metadata*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
 -rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*
 {noformat}
 If saving with {code}df.save(foo, parquet, SaveMode.Overwrite){code} Both 
 _metadata and _common_metadata are missing:
 {noformat}
 -rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
 -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
 -rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6570) Spark SQL arrays: explode() fails and cannot save array type to Parquet

2015-03-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6570:
--
Target Version/s: 1.4.0

 Spark SQL arrays: explode() fails and cannot save array type to Parquet
 -

 Key: SPARK-6570
 URL: https://issues.apache.org/jira/browse/SPARK-6570
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jon Chase

 {code}
 @Rule
 public TemporaryFolder tmp = new TemporaryFolder();
 @Test
 public void testPercentileWithExplode() throws Exception {
 StructType schema = DataTypes.createStructType(Lists.newArrayList(
 DataTypes.createStructField(col1, DataTypes.StringType, 
 false),
 DataTypes.createStructField(col2s, 
 DataTypes.createArrayType(DataTypes.IntegerType, true), true)
 ));
 JavaRDDRow rowRDD = sc.parallelize(Lists.newArrayList(
 RowFactory.create(test, new int[]{1, 2, 3})
 ));
 DataFrame df = sql.createDataFrame(rowRDD, schema);
 df.registerTempTable(df);
 df.printSchema();
 Listint[] ints = sql.sql(select col2s from df).javaRDD()
   .map(row - (int[]) row.get(0)).collect();
 assertEquals(1, ints.size());
 assertArrayEquals(new int[]{1, 2, 3}, ints.get(0));
 // fails: lateral view explode does not work: 
 java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq
 ListInteger explodedInts = sql.sql(select col2 from df lateral 
 view explode(col2s) splode as col2).javaRDD()
 .map(row - row.getInt(0)).collect();
 assertEquals(3, explodedInts.size());
 assertEquals(Lists.newArrayList(1, 2, 3), explodedInts);
 // fails: java.lang.ClassCastException: [I cannot be cast to 
 scala.collection.Seq
 df.saveAsParquetFile(tmp.getRoot().getAbsolutePath() + /parquet);
 DataFrame loadedDf = sql.load(tmp.getRoot().getAbsolutePath() + 
 /parquet);
 loadedDf.registerTempTable(loadedDf);
 Listint[] moreInts = sql.sql(select col2s from loadedDf).javaRDD()
   .map(row - (int[]) row.get(0)).collect();
 assertEquals(1, moreInts.size());
 assertArrayEquals(new int[]{1, 2, 3}, moreInts.get(0));
 }
 {code}
 {code}
 root
  |-- col1: string (nullable = false)
  |-- col2s: array (nullable = true)
  ||-- element: integer (containsNull = true)
 ERROR org.apache.spark.executor.Executor Exception in task 7.0 in stage 1.0 
 (TID 15)
 java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq
   at 
 org.apache.spark.sql.catalyst.expressions.Explode.eval(generators.scala:125) 
 ~[spark-catalyst_2.10-1.3.0.jar:1.3.0]
   at 
 org.apache.spark.sql.execution.Generate$$anonfun$2$$anonfun$apply$1.apply(Generate.scala:70)
  ~[spark-sql_2.10-1.3.0.jar:1.3.0]
   at 
 org.apache.spark.sql.execution.Generate$$anonfun$2$$anonfun$apply$1.apply(Generate.scala:69)
  ~[spark-sql_2.10-1.3.0.jar:1.3.0]
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 ~[scala-library-2.10.4.jar:na]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6529) Word2Vec transformer

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6529:
-
Fix Version/s: (was: 1.4.0)

 Word2Vec transformer
 

 Key: SPARK-6529
 URL: https://issues.apache.org/jira/browse/SPARK-6529
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xusen Yin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6209) ExecutorClassLoader can leak connections after failing to load classes from the REPL class server

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6209:
-
Fix Version/s: (was: 1.3.1)
   (was: 1.4.0)

 ExecutorClassLoader can leak connections after failing to load classes from 
 the REPL class server
 -

 Key: SPARK-6209
 URL: https://issues.apache.org/jira/browse/SPARK-6209
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.3, 1.1.2, 1.2.1, 1.3.0, 1.4.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical

 ExecutorClassLoader does not ensure proper cleanup of network connections 
 that it opens.  If it fails to load a class, it may leak partially-consumed 
 InputStreams that are connected to the REPL's HTTP class server, causing that 
 server to exhaust its thread pool, which can cause the entire job to hang.
 Here is a simple reproduction:
 With
 {code}
 ./bin/spark-shell --master local-cluster[8,8,512] 
 {code}
 run the following command:
 {code}
 sc.parallelize(1 to 1000, 1000).map { x =
   try {
   Class.forName(some.class.that.does.not.Exist)
   } catch {
   case e: Exception = // do nothing
   }
   x
 }.count()
 {code}
 This job will run 253 tasks, then will completely freeze without any errors 
 or failed tasks.
 It looks like the driver has 253 threads blocked in socketRead0() calls:
 {code}
 [joshrosen ~]$ jstack 16765 | grep socketRead0 | wc
  253 759   14674
 {code}
 e.g.
 {code}
 qtp1287429402-13 daemon prio=5 tid=0x7f868a1c nid=0x5b03 runnable 
 [0x0001159bd000]
java.lang.Thread.State: RUNNABLE
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:152)
 at java.net.SocketInputStream.read(SocketInputStream.java:122)
 at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
 at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
 at 
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
 at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1044)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280)
 at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at 
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
 at 
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
 at 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Thread.java:745) 
 {code}
 Jstack on the executors shows blocking in loadClass / findClass, where a 
 single thread is RUNNABLE and waiting to hear back from the driver and other 
 executor threads are BLOCKED on object monitor synchronization at 
 Class.forName0().
 Remotely triggering a GC on a hanging executor allows the job to progress and 
 complete more tasks before hanging again.  If I repeatedly trigger GC on all 
 of the executors, then the job runs to completion:
 {code}
 jps | grep CoarseGra | cut -d ' ' -f 1 | xargs -I {} -n 1 -P100 jcmd {} GC.run
 {code}
 The culprit is a {{catch}} block that ignores all exceptions and performs no 
 cleanup: 
 https://github.com/apache/spark/blob/v1.2.0/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L94
 This bug has been present since Spark 1.0.0, but I suspect that we haven't 
 seen it before because it's pretty hard to reproduce. Triggering this error 
 requires a job with tasks that trigger ClassNotFoundExceptions yet are still 
 able to run to completion.  It also requires that executors are able to leak 
 enough open connections to exhaust the class server's Jetty thread pool 
 limit, which requires that there are a large number of tasks (253+) and 
 either a large number of executors or a very low amount of GC pressure on 
 those executors (since GC will cause the leaked connections to be closed).
 The fix here is pretty simple: add proper resource cleanup to this class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6350) Make mesosExecutorCores configurable in mesos fine-grained mode

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6350:
-
Fix Version/s: (was: 1.3.1)
   (was: 1.4.0)

 Make mesosExecutorCores configurable in mesos fine-grained mode
 -

 Key: SPARK-6350
 URL: https://issues.apache.org/jira/browse/SPARK-6350
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Jongyoul Lee
Assignee: Jongyoul Lee
Priority: Minor

 When spark runs in mesos fine-grained mode, mesos slave launches executor 
 with # of cpus and memories. By the way, # of executor's cores is always 
 CPU_PER_TASKS as same as spark.task.cpus. If I set that values as 5 for 
 running intensive task, mesos executor always consume 5 cores without any 
 running task. This waste resources. We should set executor core as a 
 configuration variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6530) ChiSqSelector transformer

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6530:
-
Fix Version/s: (was: 1.4.0)

 ChiSqSelector transformer
 -

 Key: SPARK-6530
 URL: https://issues.apache.org/jira/browse/SPARK-6530
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xusen Yin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6528) IDF transformer

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6528:
-
Fix Version/s: (was: 1.4.0)

 IDF transformer
 ---

 Key: SPARK-6528
 URL: https://issues.apache.org/jira/browse/SPARK-6528
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xusen Yin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6194) collect() in PySpark will cause memory leak in JVM

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6194:
-
Fix Version/s: (was: 1.3.1)
   (was: 1.4.0)
   (was: 1.2.2)

 collect() in PySpark will cause memory leak in JVM
 --

 Key: SPARK-6194
 URL: https://issues.apache.org/jira/browse/SPARK-6194
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical

 It could be reproduced  by:
 {code}
 for i in range(40):
 sc.parallelize(range(5000), 10).flatMap(lambda i: range(1)).collect()
 {code}
 It will fail after 2 or 3 jobs, and run totally successfully if I add
 `gc.collect()` after each job.
 We could call _detach() for the JavaList returned by collect
 in Java, will send out a PR for this.
 Reported by Michael and commented by Josh:
 On Thu, Mar 5, 2015 at 2:39 PM, Josh Rosen joshro...@databricks.com wrote:
  Based on Py4J's Memory Model page
  (http://py4j.sourceforge.net/advanced_topics.html#py4j-memory-model):
 
  Because Java objects on the Python side are involved in a circular
  reference (JavaObject and JavaMember reference each other), these objects
  are not immediately garbage collected once the last reference to the object
  is removed (but they are guaranteed to be eventually collected if the 
  Python
  garbage collector runs before the Python program exits).
 
 
 
  In doubt, users can always call the detach function on the Python gateway
  to explicitly delete a reference on the Java side. A call to gc.collect()
  also usually works.
 
 
  Maybe we should be manually calling detach() when the Python-side has
  finished consuming temporary objects from the JVM.  Do you have a small
  workload / configuration that reproduces the OOM which we can use to test a
  fix?  I don't think that I've seen this issue in the past, but this might be
  because we mistook Java OOMs as being caused by collecting too much data
  rather than due to memory leaks.
 
  On Thu, Mar 5, 2015 at 10:41 AM, Michael Nazario mnaza...@palantir.com
  wrote:
 
  Hi Josh,
 
  I have a question about how PySpark does memory management in the Py4J
  bridge between the Java driver and the Python driver. I was wondering if
  there have been any memory problems in this system because the Python
  garbage collector does not collect circular references immediately and Py4J
  has circular references in each object it receives from Java.
 
  When I dug through the PySpark code, I seemed to find that most RDD
  actions return by calling collect. In collect, you end up calling the Java
  RDD collect and getting an iterator from that. Would this be a possible
  cause for a Java driver OutOfMemoryException because there are resources in
  Java which do not get freed up immediately?
 
  I have also seen that trying to take a lot of values from a dataset twice
  in a row can cause the Java driver to OOM (while just once works). Are 
  there
  some other memory considerations that are relevant in the driver?
 
  Thanks,
  Michael



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6323:
-
Fix Version/s: (was: 1.4.0)

 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following problems: min f ( x ) + g ( z )
 1. Loss function f ( x ) can be LeastSquareLoss and LoglikelihoodLoss. Most 
 likely we will re-use the Gradient interfaces already defined and implement 
 LoglikelihoodLoss
 2. Constraints g ( z ) supported are same as above except that we don't 
 support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we 
 don't need that for ML applications
 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which 
 in turn uses projection based solver (SPG) or proximal solvers (ADMM) based 
 on convergence speed.
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
 4. The factors will be SparseVector so that we keep shuffle size in check. 
 For example we will run with 10K ranks but we will force factors to be 
 100-sparse.
 This is closely related to Sparse LDA 
 https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
 are not using graph representation here.
 As we do scaling experiments, we will understand which flow is more suited as 
 ratings get denser (my understanding is that since we already scaled ALS to 2 
 billion ratings and we will keep sparsity in check, the same 2 billion flow 
 will scale to 10K ranks as well)...
 This JIRA is intended to extend the capabilities of ml recommendation to 
 generalized loss function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6582) Support ssl for this AvroSink in Spark Streaming External

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6582:
-
Component/s: Streaming

(Components please)

 Support ssl for this AvroSink in Spark Streaming External
 -

 Key: SPARK-6582
 URL: https://issues.apache.org/jira/browse/SPARK-6582
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: SaintBacchus

 AvroSink had already support the *ssl*,  so it's better to support *ssl* in 
 the Spark Streaming External Flume. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6060) List type missing for catalyst's package.scala

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6060:
-
Fix Version/s: (was: 1.3.0)

 List type missing for catalyst's package.scala
 --

 Key: SPARK-6060
 URL: https://issues.apache.org/jira/browse/SPARK-6060
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: Linux zeno 3.18.5 #1 SMP Sun Feb 1 23:51:17 CET 2015 
 ppc64 GNU/Linux,
 java version 1.7.0_65
 OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-2)
 OpenJDK Zero VM (build 24.65-b04, interpreted mode),
 sbt launcher version 0.13.7
Reporter: Stephan Drescher
Priority: Minor
  Labels: build, error

 Used command line: 
 build/sbt -mem 1024 -Pyarn -Phive -Dhadoop.version=2.4.0 -Pbigtop-dist 
 -DskipTests assembly
 Output:
 [error]  while compiling: 
 /home/spark/Developer/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/package.scala
 [error] during phase: jvm
 [error]  library version: version 2.10.4
 [error] compiler version: version 2.10.4
 [error]   reconstructed args: -bootclasspath 
 /usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/resources.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/rt.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/jsse.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/jce.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/charsets.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/rhino.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/jfr.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/classes:/home/spark/.sbt/boot/scala-2.10.4/lib/scala-library.jar
  -deprecation -classpath 
 

[jira] [Updated] (SPARK-5880) Change log level of batch pruning string in InMemoryColumnarTableScan from Info to Debug

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5880:
-
Fix Version/s: (was: 1.3.0)

 Change log level of batch pruning string in InMemoryColumnarTableScan from 
 Info to Debug
 

 Key: SPARK-5880
 URL: https://issues.apache.org/jira/browse/SPARK-5880
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.1, 1.3.0
Reporter: Nitin Goyal
Priority: Trivial

 In InMemoryColumnarTableScan, we make string of the statistics of all the 
 columns and log them at INFO level whenever batch pruning happens. We get a 
 performance hit in case there are a large number of batches and good number 
 of columns and almost every batch gets pruned.
 We can make the string to evaluate lazily and change log level to DEBUG



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5684) Key not found exception is thrown in case location of added partition to a parquet table is different than a path containing the partition values

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5684:
-
Fix Version/s: (was: 1.3.0)

 Key not found exception is thrown in case location of added partition to a 
 parquet table is different than a path containing the partition values
 -

 Key: SPARK-5684
 URL: https://issues.apache.org/jira/browse/SPARK-5684
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1, 1.2.0
Reporter: Yash Datta

 Create a partitioned parquet table : 
 create table test_table (dummy string) partitioned by (timestamp bigint) 
 stored as parquet;
 Add a partition to the table and specify a different location:
 alter table test_table add partition (timestamp=9) location 
 '/data/pth/different'
 Run a simple select  * query 
 we get an exception :
 15/02/09 08:27:25 ERROR thriftserver.SparkSQLDriver: Failed in [select * from 
 db4_mi2mi_binsrc1_default limit 5]
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 
 (TID 21, localhost): java
 .util.NoSuchElementException: key not found: timestamp
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.MapLike$class.apply(MapLike.scala:141)
 at scala.collection.AbstractMap.apply(Map.scala:58)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:141)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:128)
 at 
 org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:247)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 This happens because in parquet path it is assumed that (key=value) patterns 
 are present in the partition location, which is not always the case!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4558) History Server waits ~10s before starting up

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4558.
--
Resolution: Duplicate

 History Server waits ~10s before starting up
 

 Key: SPARK-4558
 URL: https://issues.apache.org/jira/browse/SPARK-4558
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
Priority: Minor

 After you call `sbin/start-history-server.sh`, it waits about 10s before 
 actually starting up. I suspect this is a subtle bug related to log checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2253) [Core] Disable partial aggregation automatically when reduction factor is low

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2253:
-
Fix Version/s: (was: 1.3.0)

 [Core] Disable partial aggregation automatically when reduction factor is low
 -

 Key: SPARK-2253
 URL: https://issues.apache.org/jira/browse/SPARK-2253
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin

 Once we see enough number of rows in partial aggregation and don't observe 
 any reduction, Aggregator should just turn off partial aggregation. This 
 reduces memory usage for high cardinality aggregations.
 This one is for Spark core. There is another ticket tracking this for SQL. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6128) Update Spark Streaming Guide for Spark 1.3

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6128.
--
Resolution: Fixed

 Update Spark Streaming Guide for Spark 1.3
 --

 Key: SPARK-6128
 URL: https://issues.apache.org/jira/browse/SPARK-6128
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical
 Fix For: 1.3.0


 Things to update
 - New Kafka Direct API
 - Python Kafka API
 - Add joins to streaming guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6582) Support ssl for this AvroSink in Spark Streaming External

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6582:
-
Fix Version/s: (was: 1.4.0)

 Support ssl for this AvroSink in Spark Streaming External
 -

 Key: SPARK-6582
 URL: https://issues.apache.org/jira/browse/SPARK-6582
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: SaintBacchus

 AvroSink had already support the *ssl*,  so it's better to support *ssl* in 
 the Spark Streaming External Flume. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6194) collect() in PySpark will cause memory leak in JVM

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6194:
-
Target Version/s: 1.3.0, 1.0.3, 1.1.2, 1.2.2  (was: 1.0.3, 1.1.2, 1.2.2, 
1.3.0)
   Fix Version/s: 1.4.0
  1.3.1
  1.2.2

Same, restored Fix versions. I fixed my query now.

 collect() in PySpark will cause memory leak in JVM
 --

 Key: SPARK-6194
 URL: https://issues.apache.org/jira/browse/SPARK-6194
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical
 Fix For: 1.2.2, 1.3.1, 1.4.0


 It could be reproduced  by:
 {code}
 for i in range(40):
 sc.parallelize(range(5000), 10).flatMap(lambda i: range(1)).collect()
 {code}
 It will fail after 2 or 3 jobs, and run totally successfully if I add
 `gc.collect()` after each job.
 We could call _detach() for the JavaList returned by collect
 in Java, will send out a PR for this.
 Reported by Michael and commented by Josh:
 On Thu, Mar 5, 2015 at 2:39 PM, Josh Rosen joshro...@databricks.com wrote:
  Based on Py4J's Memory Model page
  (http://py4j.sourceforge.net/advanced_topics.html#py4j-memory-model):
 
  Because Java objects on the Python side are involved in a circular
  reference (JavaObject and JavaMember reference each other), these objects
  are not immediately garbage collected once the last reference to the object
  is removed (but they are guaranteed to be eventually collected if the 
  Python
  garbage collector runs before the Python program exits).
 
 
 
  In doubt, users can always call the detach function on the Python gateway
  to explicitly delete a reference on the Java side. A call to gc.collect()
  also usually works.
 
 
  Maybe we should be manually calling detach() when the Python-side has
  finished consuming temporary objects from the JVM.  Do you have a small
  workload / configuration that reproduces the OOM which we can use to test a
  fix?  I don't think that I've seen this issue in the past, but this might be
  because we mistook Java OOMs as being caused by collecting too much data
  rather than due to memory leaks.
 
  On Thu, Mar 5, 2015 at 10:41 AM, Michael Nazario mnaza...@palantir.com
  wrote:
 
  Hi Josh,
 
  I have a question about how PySpark does memory management in the Py4J
  bridge between the Java driver and the Python driver. I was wondering if
  there have been any memory problems in this system because the Python
  garbage collector does not collect circular references immediately and Py4J
  has circular references in each object it receives from Java.
 
  When I dug through the PySpark code, I seemed to find that most RDD
  actions return by calling collect. In collect, you end up calling the Java
  RDD collect and getting an iterator from that. Would this be a possible
  cause for a Java driver OutOfMemoryException because there are resources in
  Java which do not get freed up immediately?
 
  I have also seen that trying to take a lot of values from a dataset twice
  in a row can cause the Java driver to OOM (while just once works). Are 
  there
  some other memory considerations that are relevant in the driver?
 
  Thanks,
  Michael



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6209) ExecutorClassLoader can leak connections after failing to load classes from the REPL class server

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6209:
-
Fix Version/s: 1.4.0
   1.3.1

Oops, my bulk change shouldn't have caught this one. I see why it is unresolved 
but has Fix versions

 ExecutorClassLoader can leak connections after failing to load classes from 
 the REPL class server
 -

 Key: SPARK-6209
 URL: https://issues.apache.org/jira/browse/SPARK-6209
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.3, 1.1.2, 1.2.1, 1.3.0, 1.4.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical
 Fix For: 1.3.1, 1.4.0


 ExecutorClassLoader does not ensure proper cleanup of network connections 
 that it opens.  If it fails to load a class, it may leak partially-consumed 
 InputStreams that are connected to the REPL's HTTP class server, causing that 
 server to exhaust its thread pool, which can cause the entire job to hang.
 Here is a simple reproduction:
 With
 {code}
 ./bin/spark-shell --master local-cluster[8,8,512] 
 {code}
 run the following command:
 {code}
 sc.parallelize(1 to 1000, 1000).map { x =
   try {
   Class.forName(some.class.that.does.not.Exist)
   } catch {
   case e: Exception = // do nothing
   }
   x
 }.count()
 {code}
 This job will run 253 tasks, then will completely freeze without any errors 
 or failed tasks.
 It looks like the driver has 253 threads blocked in socketRead0() calls:
 {code}
 [joshrosen ~]$ jstack 16765 | grep socketRead0 | wc
  253 759   14674
 {code}
 e.g.
 {code}
 qtp1287429402-13 daemon prio=5 tid=0x7f868a1c nid=0x5b03 runnable 
 [0x0001159bd000]
java.lang.Thread.State: RUNNABLE
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:152)
 at java.net.SocketInputStream.read(SocketInputStream.java:122)
 at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
 at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
 at 
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
 at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1044)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280)
 at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at 
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
 at 
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
 at 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Thread.java:745) 
 {code}
 Jstack on the executors shows blocking in loadClass / findClass, where a 
 single thread is RUNNABLE and waiting to hear back from the driver and other 
 executor threads are BLOCKED on object monitor synchronization at 
 Class.forName0().
 Remotely triggering a GC on a hanging executor allows the job to progress and 
 complete more tasks before hanging again.  If I repeatedly trigger GC on all 
 of the executors, then the job runs to completion:
 {code}
 jps | grep CoarseGra | cut -d ' ' -f 1 | xargs -I {} -n 1 -P100 jcmd {} GC.run
 {code}
 The culprit is a {{catch}} block that ignores all exceptions and performs no 
 cleanup: 
 https://github.com/apache/spark/blob/v1.2.0/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L94
 This bug has been present since Spark 1.0.0, but I suspect that we haven't 
 seen it before because it's pretty hard to reproduce. Triggering this error 
 requires a job with tasks that trigger ClassNotFoundExceptions yet are still 
 able to run to completion.  It also requires that executors are able to leak 
 enough open connections to exhaust the class server's Jetty thread pool 
 limit, which requires that there are a large number of tasks (253+) and 
 either a large number of executors or a very low amount of GC pressure on 
 those executors (since GC will cause the leaked connections to be closed).
 The fix here is pretty simple: add proper resource cleanup to this class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6006) Optimize count distinct in case of high cardinality columns

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6006:
-
Fix Version/s: (was: 1.3.0)

 Optimize count distinct in case of high cardinality columns
 ---

 Key: SPARK-6006
 URL: https://issues.apache.org/jira/browse/SPARK-6006
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1, 1.2.1
Reporter: Yash Datta
Priority: Minor

 In case there are a lot of distinct values, count distinct becomes too slow 
 since it tries to hash partial results to one map. It can be improved by 
 creating buckets/partial maps in an intermediate stage where same key from 
 multiple partial maps of first stage hash to the same bucket. Later we can 
 sum the size of these buckets to get total distinct count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5720) `Create Table Like` in HiveContext need support `like registered temporary table`

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5720:
-
Fix Version/s: (was: 1.3.0)

 `Create Table Like` in HiveContext need support `like registered temporary 
 table`
 -

 Key: SPARK-5720
 URL: https://issues.apache.org/jira/browse/SPARK-5720
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Li Sheng
   Original Estimate: 72h
  Remaining Estimate: 72h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5192) Parquet fails to parse schema contains '\r'

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5192:
-
Fix Version/s: (was: 1.3.0)

 Parquet fails to parse schema contains '\r'
 ---

 Key: SPARK-5192
 URL: https://issues.apache.org/jira/browse/SPARK-5192
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
 Environment: windows7 + Intellj idea 13.0.2 
Reporter: cen yuhai
Priority: Minor

 I think this is actually a bug in parquet, when i debuged 'ParquetTestData', 
 i found a exception as below. So i  download the source of MessageTypeParser, 
 the funtion 'isWhitespace' do not check for '\r'
 private boolean isWhitespace(String t) {
   return t.equals( ) || t.equals(\t) || t.equals(\n);
 }
 So I replace all '\r' to work around this issue.
   val subTestSchema =
 
   message myrecord {
   optional boolean myboolean;
   optional int64 mylong;
   }
 .replaceAll(\r,)
 at line 0: message myrecord {
   at 
 parquet.schema.MessageTypeParser.asRepetition(MessageTypeParser.java:203)
   at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:101)
   at 
 parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:96)
   at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:89)
   at 
 parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:79)
   at 
 org.apache.spark.sql.parquet.ParquetTestData$.writeFile(ParquetTestData.scala:221)
   at 
 org.apache.spark.sql.parquet.ParquetQuerySuite.beforeAll(ParquetQuerySuite.scala:92)
   at 
 org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
   at 
 org.apache.spark.sql.parquet.ParquetQuerySuite.beforeAll(ParquetQuerySuite.scala:85)
   at 
 org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
   at 
 org.apache.spark.sql.parquet.ParquetQuerySuite.run(ParquetQuerySuite.scala:85)
   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5264) Support `drop temporary table [if exists]` DDL command

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5264:
-
Fix Version/s: (was: 1.3.0)

 Support `drop temporary table [if exists]` DDL command 
 ---

 Key: SPARK-5264
 URL: https://issues.apache.org/jira/browse/SPARK-5264
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.3.0
Reporter: Li Sheng
Priority: Minor
   Original Estimate: 72h
  Remaining Estimate: 72h

 Support `drop table` DDL command 
 i.e DROP [TEMPORARY] TABLE [IF EXISTS]tbl_name



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4752) Classifier based on artificial neural network

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4752:
-
Fix Version/s: (was: 1.3.0)

 Classifier based on artificial neural network
 -

 Key: SPARK-4752
 URL: https://issues.apache.org/jira/browse/SPARK-4752
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Alexander Ulanov
   Original Estimate: 168h
  Remaining Estimate: 168h

 Implement classifier based on artificial neural network (ANN). Requirements:
 1) Use the existing artificial neural network implementation 
 https://issues.apache.org/jira/browse/SPARK-2352, 
 https://github.com/apache/spark/pull/1290
 2) Extend MLlib ClassificationModel trait, 
 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training,
 4) Be able to return the ANN model



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5362:
-
Fix Version/s: (was: 1.3.0)

 Gradient and Optimizer to support generic output (instead of label) and data 
 batches
 

 Key: SPARK-5362
 URL: https://issues.apache.org/jira/browse/SPARK-5362
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Alexander Ulanov
   Original Estimate: 24h
  Remaining Estimate: 24h

 Currently, Gradient and Optimizer interfaces support data in form of 
 RDD[Double, Vector] which refers to label and features. This limits its 
 application to classification problems. For example, artificial neural 
 network demands Vector as output (instead of label: Double). Moreover, 
 current interface does not support data batches. I propose to replace label: 
 Double with output: Vector. It enables passing generic output instead of 
 label and also passing data and output batches stored in corresponding 
 vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5522) Accelerate the History Server start

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5522.
--
Resolution: Fixed

Looks resolved by https://github.com/apache/spark/pull/4525 but just never got 
marked as such.

 Accelerate the History Server start
 ---

 Key: SPARK-5522
 URL: https://issues.apache.org/jira/browse/SPARK-5522
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Affects Versions: 1.0.0
Reporter: Liangliang Gu
Assignee: Liangliang Gu
 Fix For: 1.4.0


 When starting the history server, all the log files will be fetched and 
 parsed in order to get the applications' meta data e.g. App Name, Start Time, 
 Duration, etc. In our production cluster, there exist 2600 log files (160G) 
 in HDFS and it costs 3 hours to restart the history server, which is a little 
 bit too long for us.
 It would be better, if the history server can show logs with missing 
 information during start-up and fill the missing information after fetching 
 and parsing a log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6470) Allow Spark apps to put YARN node labels in their requests

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6470:
---

Assignee: Sandy Ryza  (was: Apache Spark)

 Allow Spark apps to put YARN node labels in their requests
 --

 Key: SPARK-6470
 URL: https://issues.apache.org/jira/browse/SPARK-6470
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Sandy Ryza
Assignee: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6470) Allow Spark apps to put YARN node labels in their requests

2015-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385363#comment-14385363
 ] 

Apache Spark commented on SPARK-6470:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/5242

 Allow Spark apps to put YARN node labels in their requests
 --

 Key: SPARK-6470
 URL: https://issues.apache.org/jira/browse/SPARK-6470
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Sandy Ryza
Assignee: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6470) Allow Spark apps to put YARN node labels in their requests

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6470:
---

Assignee: Apache Spark  (was: Sandy Ryza)

 Allow Spark apps to put YARN node labels in their requests
 --

 Key: SPARK-6470
 URL: https://issues.apache.org/jira/browse/SPARK-6470
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Sandy Ryza
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message

2015-03-28 Thread Spiro Michaylov (JIRA)
Spiro Michaylov created SPARK-6587:
--

 Summary: Inferring schema for case class hierarchy fails with 
mysterious message
 Key: SPARK-6587
 URL: https://issues.apache.org/jira/browse/SPARK-6587
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: At least Windows 8, Scala 2.11.2.  
Reporter: Spiro Michaylov


(Don't know if this is a functionality bug, error reporting bug or an RFE ...)

I define the following hierarchy:

{code}
private abstract class MyHolder
private case class StringHolder(s: String) extends MyHolder
private case class IntHolder(i: Int) extends MyHolder
private case class BooleanHolder(b: Boolean) extends MyHolder
{code}

and a top level case class:

{code}
private case class Thing(key: Integer, foo: MyHolder)
{code}

When I try to convert it:

{code}
val things = Seq(
  Thing(1, IntHolder(42)),
  Thing(2, StringHolder(hello)),
  Thing(3, BooleanHolder(false))
)
val thingsDF = sc.parallelize(things, 4).toDF()

thingsDF.registerTempTable(things)

val all = sqlContext.sql(SELECT * from things)
{code}

I get the following stack trace:

{quote}
Exception in thread main scala.MatchError: 
sql.CaseClassSchemaProblem.MyHolder (of class 
scala.reflect.internal.Types$ClassNoArgsTypeRef)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157)
at scala.collection.immutable.List.map(List.scala:276)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312)
at 
org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250)
at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35)
at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
{quote}

I wrote this to answer [a question on 
StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql]
 which uses a much simpler approach and suffers the same problem.

Looking at what seems to me to be the [relevant unit test 
suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala]
 I see that this case is not covered.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-28 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-2426:

Affects Version/s: (was: 1.3.0)
   1.4.0

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6588) Private VPC's and subnets currently don't work with the Spark ec2 script

2015-03-28 Thread Michelangelo D'Agostino (JIRA)
Michelangelo D'Agostino created SPARK-6588:
--

 Summary: Private VPC's and subnets currently don't work with the 
Spark ec2 script
 Key: SPARK-6588
 URL: https://issues.apache.org/jira/browse/SPARK-6588
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.3.0
Reporter: Michelangelo D'Agostino
Priority: Minor


The spark_ec2.py script currently references the ip_address and public_dns_name 
attributes of an instance.  On private networks, these fields aren't set, so we 
have problems.

The solution, which I've just finished coding up, is to introduce a 
--private-ips flag that instead refers to the private_ip_address attribute in 
both cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6588) Private VPC's and subnets currently don't work with the Spark ec2 script

2015-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6588.
--
Resolution: Duplicate

Also SPARK-5246, SPARK-6220. Have a look at the existing JIRAs and see if you 
can resolve one of them to this effect.

 Private VPC's and subnets currently don't work with the Spark ec2 script
 

 Key: SPARK-6588
 URL: https://issues.apache.org/jira/browse/SPARK-6588
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.3.0
Reporter: Michelangelo D'Agostino
Priority: Minor

 The spark_ec2.py script currently references the ip_address and 
 public_dns_name attributes of an instance.  On private networks, these fields 
 aren't set, so we have problems.
 The solution, which I've just finished coding up, is to introduce a 
 --private-ips flag that instead refers to the private_ip_address attribute in 
 both cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6588) Private VPC's and subnets currently don't work with the Spark ec2 script

2015-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385435#comment-14385435
 ] 

Apache Spark commented on SPARK-6588:
-

User 'mdagost' has created a pull request for this issue:
https://github.com/apache/spark/pull/5244

 Private VPC's and subnets currently don't work with the Spark ec2 script
 

 Key: SPARK-6588
 URL: https://issues.apache.org/jira/browse/SPARK-6588
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.3.0
Reporter: Michelangelo D'Agostino
Priority: Minor

 The spark_ec2.py script currently references the ip_address and 
 public_dns_name attributes of an instance.  On private networks, these fields 
 aren't set, so we have problems.
 The solution, which I've just finished coding up, is to introduce a 
 --private-ips flag that instead refers to the private_ip_address attribute in 
 both cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5894) Add PolynomialMapper

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5894:
---

Assignee: Apache Spark

 Add PolynomialMapper
 

 Key: SPARK-5894
 URL: https://issues.apache.org/jira/browse/SPARK-5894
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Apache Spark

 `PolynomialMapper` takes a vector column and outputs a vector column with 
 polynomial feature mapping.
 {code}
 val poly = new PolynomialMapper()
   .setInputCol(features)
   .setDegree(2)
   .setOutputCols(polyFeatures)
 {code}
 It should handle the output feature names properly. Maybe we can make a 
 better name for it instead of calling it `PolynomialMapper`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5894) Add PolynomialMapper

2015-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385446#comment-14385446
 ] 

Apache Spark commented on SPARK-5894:
-

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5245

 Add PolynomialMapper
 

 Key: SPARK-5894
 URL: https://issues.apache.org/jira/browse/SPARK-5894
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng

 `PolynomialMapper` takes a vector column and outputs a vector column with 
 polynomial feature mapping.
 {code}
 val poly = new PolynomialMapper()
   .setInputCol(features)
   .setDegree(2)
   .setOutputCols(polyFeatures)
 {code}
 It should handle the output feature names properly. Maybe we can make a 
 better name for it instead of calling it `PolynomialMapper`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5894) Add PolynomialMapper

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5894:
---

Assignee: (was: Apache Spark)

 Add PolynomialMapper
 

 Key: SPARK-5894
 URL: https://issues.apache.org/jira/browse/SPARK-5894
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng

 `PolynomialMapper` takes a vector column and outputs a vector column with 
 polynomial feature mapping.
 {code}
 val poly = new PolynomialMapper()
   .setInputCol(features)
   .setDegree(2)
   .setOutputCols(polyFeatures)
 {code}
 It should handle the output feature names properly. Maybe we can make a 
 better name for it instead of calling it `PolynomialMapper`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6582) Support ssl for this AvroSink in Spark Streaming External

2015-03-28 Thread SaintBacchus (JIRA)
SaintBacchus created SPARK-6582:
---

 Summary: Support ssl for this AvroSink in Spark Streaming External
 Key: SPARK-6582
 URL: https://issues.apache.org/jira/browse/SPARK-6582
 Project: Spark
  Issue Type: Improvement
Reporter: SaintBacchus
 Fix For: 1.4.0


AvroSink had already support the *ssl*,  so it's better to support *ssl* in the 
Spark Streaming External Flume. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6583) Support aggregated function in order by

2015-03-28 Thread Yadong Qi (JIRA)
Yadong Qi created SPARK-6583:


 Summary: Support aggregated function in order by
 Key: SPARK-6583
 URL: https://issues.apache.org/jira/browse/SPARK-6583
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yadong Qi






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6584) Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of partition's executor location.

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6584:
---

Assignee: Apache Spark

 Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of 
 partition's executor  location.
 ---

 Key: SPARK-6584
 URL: https://issues.apache.org/jira/browse/SPARK-6584
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: SaintBacchus
Assignee: Apache Spark

 The function *RDD.getPreferredLocations* can only be set the host awareness 
 prefer locations.
 If some *RDD* wants to be scheduled by executor(such as BlockRDD), spark can 
 do nothing for this.
 So  I want to provide *ExecutorPrefixTaskLocation* to support the rdd which 
 can be aware of partition's executor location. This mechanism can avoid data 
 transfor in the case of many executor in the same host.
 I think it's very useful especially for *SparkStreaming* since the 
 *Receriver* save data into the *BlockManger* and then become a BlockRDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6584) Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of partition's executor location.

2015-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385221#comment-14385221
 ] 

Apache Spark commented on SPARK-6584:
-

User 'SaintBacchus' has created a pull request for this issue:
https://github.com/apache/spark/pull/5240

 Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of 
 partition's executor  location.
 ---

 Key: SPARK-6584
 URL: https://issues.apache.org/jira/browse/SPARK-6584
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: SaintBacchus

 The function *RDD.getPreferredLocations* can only be set the host awareness 
 prefer locations.
 If some *RDD* wants to be scheduled by executor(such as BlockRDD), spark can 
 do nothing for this.
 So  I want to provide *ExecutorPrefixTaskLocation* to support the rdd which 
 can be aware of partition's executor location. This mechanism can avoid data 
 transfor in the case of many executor in the same host.
 I think it's very useful especially for *SparkStreaming* since the 
 *Receriver* save data into the *BlockManger* and then become a BlockRDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org




[jira] [Assigned] (SPARK-6584) Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of partition's executor location.

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6584:
---

Assignee: (was: Apache Spark)

 Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of 
 partition's executor  location.
 ---

 Key: SPARK-6584
 URL: https://issues.apache.org/jira/browse/SPARK-6584
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: SaintBacchus

 The function *RDD.getPreferredLocations* can only be set the host awareness 
 prefer locations.
 If some *RDD* wants to be scheduled by executor(such as BlockRDD), spark can 
 do nothing for this.
 So  I want to provide *ExecutorPrefixTaskLocation* to support the rdd which 
 can be aware of partition's executor location. This mechanism can avoid data 
 transfor in the case of many executor in the same host.
 I think it's very useful especially for *SparkStreaming* since the 
 *Receriver* save data into the *BlockManger* and then become a BlockRDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6464) Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6464:
---

Assignee: (was: Apache Spark)

 Add a new transformation of rdd named processCoalesce which was  particularly 
 to deal with the small and cached rdd
 ---

 Key: SPARK-6464
 URL: https://issues.apache.org/jira/browse/SPARK-6464
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: SaintBacchus
 Attachments: screenshot-1.png


 Nowadays, the transformation *coalesce* was always used to expand or reduce 
 the number of the partition in order to gain a good performance.
 But *coalesce* can't make sure that the child partition will be executed in 
 the same executor as the parent partition. And this will lead to have a large 
 network transfer.
 In some scenario such as I mentioned in the title +small and cached rdd+, we 
 want to coalesce all the partition in the same executor into one partition 
 and make sure the child partition will be executed in this executor. It can 
 avoid network transfer and reduce the scheduler of the Tasks and also can 
 reused the cpu core to do other job. 
 In this scenario, our performance had improved 20% than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6464) Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6464:
---

Assignee: Apache Spark

 Add a new transformation of rdd named processCoalesce which was  particularly 
 to deal with the small and cached rdd
 ---

 Key: SPARK-6464
 URL: https://issues.apache.org/jira/browse/SPARK-6464
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: SaintBacchus
Assignee: Apache Spark
 Attachments: screenshot-1.png


 Nowadays, the transformation *coalesce* was always used to expand or reduce 
 the number of the partition in order to gain a good performance.
 But *coalesce* can't make sure that the child partition will be executed in 
 the same executor as the parent partition. And this will lead to have a large 
 network transfer.
 In some scenario such as I mentioned in the title +small and cached rdd+, we 
 want to coalesce all the partition in the same executor into one partition 
 and make sure the child partition will be executed in this executor. It can 
 avoid network transfer and reduce the scheduler of the Tasks and also can 
 reused the cpu core to do other job. 
 In this scenario, our performance had improved 20% than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5338) Support cluster mode with Mesos

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5338:
---

Assignee: (was: Apache Spark)

 Support cluster mode with Mesos
 ---

 Key: SPARK-5338
 URL: https://issues.apache.org/jira/browse/SPARK-5338
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen

 Currently using Spark with Mesos, the only supported deployment is client 
 mode.
 It is also useful to have a cluster mode deployment that can be shared and 
 long running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6584) Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of partition's executor location.

2015-03-28 Thread SaintBacchus (JIRA)
SaintBacchus created SPARK-6584:
---

 Summary: Provide ExecutorPrefixTaskLocation to support the rdd 
which can be aware of partition's executor  location.
 Key: SPARK-6584
 URL: https://issues.apache.org/jira/browse/SPARK-6584
 Project: Spark
  Issue Type: Sub-task
Affects Versions: 1.4.0
Reporter: SaintBacchus


The function *RDD.getPreferredLocations* can only be set the host awareness 
prefer locations.
If some *RDD* wants to be scheduled by executor(such as BlockRDD), spark can do 
nothing for this.
So  I want to provide *ExecutorPrefixTaskLocation* to support the rdd which can 
be aware of partition's executor location. This mechanism can avoid data 
transfor in the case of many executor in the same host.
I think it's very useful especially for *SparkStreaming* since the *Receriver* 
save data into the *BlockManger* and then become a BlockRDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5338) Support cluster mode with Mesos

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5338:
---

Assignee: Apache Spark

 Support cluster mode with Mesos
 ---

 Key: SPARK-5338
 URL: https://issues.apache.org/jira/browse/SPARK-5338
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen
Assignee: Apache Spark

 Currently using Spark with Mesos, the only supported deployment is client 
 mode.
 It is also useful to have a cluster mode deployment that can be shared and 
 long running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6585) FileServerSuite.test (HttpFileServer should not work with SSL when the server is untrusted) failed is some evn.

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6585:
---

Assignee: Apache Spark

 FileServerSuite.test (HttpFileServer should not work with SSL when the 
 server is untrusted) failed is some evn.
 -

 Key: SPARK-6585
 URL: https://issues.apache.org/jira/browse/SPARK-6585
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.3.0
Reporter: June
Assignee: Apache Spark
Priority: Minor

 In my test machine, FileServerSuite.test (HttpFileServer should not work 
 with SSL when the server is untrusted) case throw SSLException not 
 SSLHandshakeException, suggest change to catch SSLException to  improve test 
 case 's robustness.
 [info] - HttpFileServer should not work with SSL when the server is untrusted 
 *** FAILED *** (69 milliseconds)
 [info]   Expected exception javax.net.ssl.SSLHandshakeException to be thrown, 
 but javax.net.ssl.SSLException was thrown. (FileServerSuite.scala:231)
 [info]   org.scalatest.exceptions.TestFailedException:
 [info]   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496)
 [info]   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
 [info]   at org.scalatest.Assertions$class.intercept(Assertions.scala:1004)
 [info]   at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
 [info]   at 
 org.apache.spark.FileServerSuite$$anonfun$15.apply$mcV$sp(FileServerSuite.scala:231)
 [info]   at 
 org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
 [info]   at 
 org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
 [info]   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
 [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
 [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
 [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
 [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
 [info]   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
 [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
 [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
 [info]   at 
 org.apache.spark.FileServerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(FileServerSuite.scala:34)
 [info]   at 
 org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5277) SparkSqlSerializer does not register user specified KryoRegistrators

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5277:
---

Assignee: Apache Spark

 SparkSqlSerializer does not register user specified KryoRegistrators 
 -

 Key: SPARK-5277
 URL: https://issues.apache.org/jira/browse/SPARK-5277
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1, 1.3.0
Reporter: Max Seiden
Assignee: Apache Spark

 Although the SparkSqlSerializer class extends the KryoSerializer in core, 
 it's overridden newKryo() does not call super.newKryo(). This results in 
 inconsistent serializer behaviors depending on whether a KryoSerializer 
 instance or a SparkSqlSerializer instance is used. This may also be related 
 to the TODO in KryoResourcePool, which uses KryoSerializer instead of 
 SparkSqlSerializer due to yet-to-be-investigated test failures.
 An example of the divergence in behavior: The Exchange operator creates a new 
 SparkSqlSerializer instance (with an empty conf; another issue) when it is 
 constructed, whereas the GENERIC ColumnType pulls a KryoSerializer out of the 
 resource pool (see above). The result is that the serialized in-memory 
 columns are created using the user provided serializers / registrators, while 
 serialization during exchange does not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5277) SparkSqlSerializer does not register user specified KryoRegistrators

2015-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385172#comment-14385172
 ] 

Apache Spark commented on SPARK-5277:
-

User 'mhseiden' has created a pull request for this issue:
https://github.com/apache/spark/pull/5237

 SparkSqlSerializer does not register user specified KryoRegistrators 
 -

 Key: SPARK-5277
 URL: https://issues.apache.org/jira/browse/SPARK-5277
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1, 1.3.0
Reporter: Max Seiden

 Although the SparkSqlSerializer class extends the KryoSerializer in core, 
 it's overridden newKryo() does not call super.newKryo(). This results in 
 inconsistent serializer behaviors depending on whether a KryoSerializer 
 instance or a SparkSqlSerializer instance is used. This may also be related 
 to the TODO in KryoResourcePool, which uses KryoSerializer instead of 
 SparkSqlSerializer due to yet-to-be-investigated test failures.
 An example of the divergence in behavior: The Exchange operator creates a new 
 SparkSqlSerializer instance (with an empty conf; another issue) when it is 
 constructed, whereas the GENERIC ColumnType pulls a KryoSerializer out of the 
 resource pool (see above). The result is that the serialized in-memory 
 columns are created using the user provided serializers / registrators, while 
 serialization during exchange does not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5277) SparkSqlSerializer does not register user specified KryoRegistrators

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5277:
---

Assignee: (was: Apache Spark)

 SparkSqlSerializer does not register user specified KryoRegistrators 
 -

 Key: SPARK-5277
 URL: https://issues.apache.org/jira/browse/SPARK-5277
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1, 1.3.0
Reporter: Max Seiden

 Although the SparkSqlSerializer class extends the KryoSerializer in core, 
 it's overridden newKryo() does not call super.newKryo(). This results in 
 inconsistent serializer behaviors depending on whether a KryoSerializer 
 instance or a SparkSqlSerializer instance is used. This may also be related 
 to the TODO in KryoResourcePool, which uses KryoSerializer instead of 
 SparkSqlSerializer due to yet-to-be-investigated test failures.
 An example of the divergence in behavior: The Exchange operator creates a new 
 SparkSqlSerializer instance (with an empty conf; another issue) when it is 
 constructed, whereas the GENERIC ColumnType pulls a KryoSerializer out of the 
 resource pool (see above). The result is that the serialized in-memory 
 columns are created using the user provided serializers / registrators, while 
 serialization during exchange does not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6577) SparseMatrix should be supported in PySpark

2015-03-28 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385183#comment-14385183
 ] 

Joseph K. Bradley commented on SPARK-6577:
--

Good point.  [~mengxr], do we want to require scipy and add UDTs which handle 
numpy and scipy dense and sparse vectors and matrices?  Or do we want to add 
our own SparseMatrix?

 SparseMatrix should be supported in PySpark
 ---

 Key: SPARK-6577
 URL: https://issues.apache.org/jira/browse/SPARK-6577
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Manoj Kumar





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6585) FileServerSuite.test (HttpFileServer should not work with SSL when the server is untrusted) failed is some evn.

2015-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6585:
---

Assignee: (was: Apache Spark)

 FileServerSuite.test (HttpFileServer should not work with SSL when the 
 server is untrusted) failed is some evn.
 -

 Key: SPARK-6585
 URL: https://issues.apache.org/jira/browse/SPARK-6585
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.3.0
Reporter: June
Priority: Minor

 In my test machine, FileServerSuite.test (HttpFileServer should not work 
 with SSL when the server is untrusted) case throw SSLException not 
 SSLHandshakeException, suggest change to catch SSLException to  improve test 
 case 's robustness.
 [info] - HttpFileServer should not work with SSL when the server is untrusted 
 *** FAILED *** (69 milliseconds)
 [info]   Expected exception javax.net.ssl.SSLHandshakeException to be thrown, 
 but javax.net.ssl.SSLException was thrown. (FileServerSuite.scala:231)
 [info]   org.scalatest.exceptions.TestFailedException:
 [info]   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496)
 [info]   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
 [info]   at org.scalatest.Assertions$class.intercept(Assertions.scala:1004)
 [info]   at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
 [info]   at 
 org.apache.spark.FileServerSuite$$anonfun$15.apply$mcV$sp(FileServerSuite.scala:231)
 [info]   at 
 org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
 [info]   at 
 org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
 [info]   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
 [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
 [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
 [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
 [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
 [info]   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
 [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
 [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
 [info]   at 
 org.apache.spark.FileServerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(FileServerSuite.scala:34)
 [info]   at 
 org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5277) SparkSqlSerializer does not register user specified KryoRegistrators

2015-03-28 Thread Max Seiden (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Seiden updated SPARK-5277:
--
Affects Version/s: (was: 1.2.0)
   1.2.1
   1.3.0

 SparkSqlSerializer does not register user specified KryoRegistrators 
 -

 Key: SPARK-5277
 URL: https://issues.apache.org/jira/browse/SPARK-5277
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1, 1.3.0
Reporter: Max Seiden

 Although the SparkSqlSerializer class extends the KryoSerializer in core, 
 it's overridden newKryo() does not call super.newKryo(). This results in 
 inconsistent serializer behaviors depending on whether a KryoSerializer 
 instance or a SparkSqlSerializer instance is used. This may also be related 
 to the TODO in KryoResourcePool, which uses KryoSerializer instead of 
 SparkSqlSerializer due to yet-to-be-investigated test failures.
 An example of the divergence in behavior: The Exchange operator creates a new 
 SparkSqlSerializer instance (with an empty conf; another issue) when it is 
 constructed, whereas the GENERIC ColumnType pulls a KryoSerializer out of the 
 resource pool (see above). The result is that the serialized in-memory 
 columns are created using the user provided serializers / registrators, while 
 serialization during exchange does not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6464) Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd

2015-03-28 Thread SaintBacchus (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SaintBacchus updated SPARK-6464:

Affects Version/s: (was: 1.3.0)
   1.4.0

 Add a new transformation of rdd named processCoalesce which was  particularly 
 to deal with the small and cached rdd
 ---

 Key: SPARK-6464
 URL: https://issues.apache.org/jira/browse/SPARK-6464
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: SaintBacchus
 Attachments: screenshot-1.png


 Nowadays, the transformation *coalesce* was always used to expand or reduce 
 the number of the partition in order to gain a good performance.
 But *coalesce* can't make sure that the child partition will be executed in 
 the same executor as the parent partition. And this will lead to have a large 
 network transfer.
 In some scenario such as I mentioned in the title +small and cached rdd+, we 
 want to coalesce all the partition in the same executor into one partition 
 and make sure the child partition will be executed in this executor. It can 
 avoid network transfer and reduce the scheduler of the Tasks and also can 
 reused the cpu core to do other job. 
 In this scenario, our performance had improved 20% than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6585) FileServerSuite.test (HttpFileServer should not work with SSL when the server is untrusted) failed is some evn.

2015-03-28 Thread June (JIRA)
June created SPARK-6585:
---

 Summary: FileServerSuite.test (HttpFileServer should not work 
with SSL when the server is untrusted) failed is some evn.
 Key: SPARK-6585
 URL: https://issues.apache.org/jira/browse/SPARK-6585
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.3.0
Reporter: June
Priority: Minor


In my test machine, FileServerSuite.test (HttpFileServer should not work with 
SSL when the server is untrusted) case throw SSLException not 
SSLHandshakeException, suggest change to catch SSLException to  add test case 
robustness.

[info] - HttpFileServer should not work with SSL when the server is untrusted 
*** FAILED *** (69 milliseconds)
[info]   Expected exception javax.net.ssl.SSLHandshakeException to be thrown, 
but javax.net.ssl.SSLException was thrown. (FileServerSuite.scala:231)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496)
[info]   at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info]   at org.scalatest.Assertions$class.intercept(Assertions.scala:1004)
[info]   at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
[info]   at 
org.apache.spark.FileServerSuite$$anonfun$15.apply$mcV$sp(FileServerSuite.scala:231)
[info]   at 
org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
[info]   at 
org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
[info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
[info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
[info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
[info]   at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
[info]   at 
org.apache.spark.FileServerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(FileServerSuite.scala:34)
[info]   at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6577) SparseMatrix should be supported in PySpark

2015-03-28 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385223#comment-14385223
 ] 

Manoj Kumar commented on SPARK-6577:


Ah, I just noticed that SciPy is an optional dependency. In any case, I believe 
in that any case having a if _have_scipy, else clause, would lead to more 
lines of code to maintain. We could either have SciPy has a hard dependency, 
which would mean SparseMatrix would be a wrapper to scipy.CSR routines or we 
could just implement our own methods.



 SparseMatrix should be supported in PySpark
 ---

 Key: SPARK-6577
 URL: https://issues.apache.org/jira/browse/SPARK-6577
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Manoj Kumar





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org