[jira] [Created] (SPARK-6603) SQLContext.registerFunction - SQLContext.udf.register

2015-03-30 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-6603:
--

 Summary: SQLContext.registerFunction - SQLContext.udf.register
 Key: SPARK-6603
 URL: https://issues.apache.org/jira/browse/SPARK-6603
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin
Assignee: Davies Liu


We didn't change the Python implementation to use that. Maybe the best strategy 
is to deprecate SQLContext.registerFunction, and just add 
SQLContext.udf.register.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6604) Specify ip of python server scoket

2015-03-30 Thread Weizhong (JIRA)
Weizhong created SPARK-6604:
---

 Summary: Specify ip of python server scoket
 Key: SPARK-6604
 URL: https://issues.apache.org/jira/browse/SPARK-6604
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Weizhong
Priority: Minor


In driver now will start a server socket and use a wildcard ip, use 127.0.0.0 
is more reasonable, as we only use it by local Python process



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6605) Same transformation in DStream leads to different result

2015-03-30 Thread SaintBacchus (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386324#comment-14386324
 ] 

SaintBacchus commented on SPARK-6605:
-

Hi, [~tdas] can you have a look at this problem? Is this acceptable for 
streaming users?

 Same transformation in DStream leads to different result
 

 Key: SPARK-6605
 URL: https://issues.apache.org/jira/browse/SPARK-6605
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: SaintBacchus
 Fix For: 1.4.0


 The transformation *reduceByKeyAndWindow* has two implementations: one use 
 the *WindowDstream* and the other use *ReducedWindowedDStream*.
 But the result always is the same, except when an empty windows occurs.
 As a wordcount example, if a period of time (larger than window time) has no 
 data coming, the first *reduceByKeyAndWindow*  has no elem inside but the 
 second has many elem with the zero value inside.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6604) Specify ip of python server scoket

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6604:
---

Assignee: (was: Apache Spark)

 Specify ip of python server scoket
 --

 Key: SPARK-6604
 URL: https://issues.apache.org/jira/browse/SPARK-6604
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Weizhong
Priority: Minor

 In driver now will start a server socket and use a wildcard ip, use 127.0.0.0 
 is more reasonable, as we only use it by local Python process



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5895) Add VectorSlicer

2015-03-30 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386296#comment-14386296
 ] 

Xusen Yin commented on SPARK-5895:
--

Is it possible to select by type or value? Like in Pandas:

data = raw_data.select_dtypes(include=[np.float64, np.int64])

Say, I want to select all computable types such as int, double, float, or I 
want to select all columns that include Nan.



 Add VectorSlicer
 

 Key: SPARK-5895
 URL: https://issues.apache.org/jira/browse/SPARK-5895
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng

 `VectorSlicer` takes a vector column and output a vector column with a subset 
 of features.
 {code}
 val vs = new VectorSlicer()
   .setInputCol(user)
   .setSelectedFeatures(age, salary)
   .setOutputCol(usefulUserFeatures)
 {code}
 We should allow specifying selected features by indices and by names. It 
 should preserve the output names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6594) Spark Streaming can't receive data from kafka

2015-03-30 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386313#comment-14386313
 ] 

Saisai Shao commented on SPARK-6594:


Hi [~q79969786], would you please paste more warning logs. From your code it is 
hard to find the problem why Spark Streaming cannot receive data. Also would 
you please let us know what's version of Spark you're using?

I guess mostly this issue is probably your misuse of Spark Streaming and Kafka, 
since lots of guys use Kafka + Spark Streaming, if this is a bug, it will be 
reported out soon.

 Spark Streaming can't receive data from kafka
 -

 Key: SPARK-6594
 URL: https://issues.apache.org/jira/browse/SPARK-6594
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.1
 Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1
Reporter: q79969786

 I use KafkaUtils to receive data from Kafka In my Spark streaming application 
 as follows:
 MapString, Integer topicorder = new HashMapString, Integer();
 topicorder.put(order, Integer.valueOf(readThread));
 JavaPairReceiverInputDStreamString, String jPRIDSOrder = 
 KafkaUtils.createStream(jssc, zkQuorum, group, topicorder);
 It worked well at fist, but after I submit this application several times, 
 Spark streaming can‘t  receive data anymore(Kafka works well).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6600:
---

Assignee: (was: Apache Spark)

 Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway  
 --

 Key: SPARK-6600
 URL: https://issues.apache.org/jira/browse/SPARK-6600
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein

 Use case: User has set up the hadoop hdfs nfs gateway service on their 
 spark_ec2.py launched cluster, and wants to mount that on their local 
 machine. 
 Requires the following ports to be opened on incoming rule set for MASTER for 
 both UDP and TCP: 111, 2049, 4242.
 (I have tried this and it works)
 Note that this issue *does not* cover the implementation of a hdfs nfs 
 gateway module in the spark-ec2 project. See linked issue. 
 Reference:
 https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6600:
---

Assignee: Apache Spark

 Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway  
 --

 Key: SPARK-6600
 URL: https://issues.apache.org/jira/browse/SPARK-6600
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein
Assignee: Apache Spark

 Use case: User has set up the hadoop hdfs nfs gateway service on their 
 spark_ec2.py launched cluster, and wants to mount that on their local 
 machine. 
 Requires the following ports to be opened on incoming rule set for MASTER for 
 both UDP and TCP: 111, 2049, 4242.
 (I have tried this and it works)
 Note that this issue *does not* cover the implementation of a hdfs nfs 
 gateway module in the spark-ec2 project. See linked issue. 
 Reference:
 https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway

2015-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386267#comment-14386267
 ] 

Apache Spark commented on SPARK-6600:
-

User 'florianverhein' has created a pull request for this issue:
https://github.com/apache/spark/pull/5257

 Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway  
 --

 Key: SPARK-6600
 URL: https://issues.apache.org/jira/browse/SPARK-6600
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein

 Use case: User has set up the hadoop hdfs nfs gateway service on their 
 spark_ec2.py launched cluster, and wants to mount that on their local 
 machine. 
 Requires the following ports to be opened on incoming rule set for MASTER for 
 both UDP and TCP: 111, 2049, 4242.
 (I have tried this and it works)
 Note that this issue *does not* cover the implementation of a hdfs nfs 
 gateway module in the spark-ec2 project. See linked issue. 
 Reference:
 https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6604) Specify ip of python server scoket

2015-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386268#comment-14386268
 ] 

Apache Spark commented on SPARK-6604:
-

User 'Sephiroth-Lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5256

 Specify ip of python server scoket
 --

 Key: SPARK-6604
 URL: https://issues.apache.org/jira/browse/SPARK-6604
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Weizhong
Priority: Minor

 In driver now will start a server socket and use a wildcard ip, use 127.0.0.0 
 is more reasonable, as we only use it by local Python process



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6604) Specify ip of python server scoket

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6604:
---

Assignee: Apache Spark

 Specify ip of python server scoket
 --

 Key: SPARK-6604
 URL: https://issues.apache.org/jira/browse/SPARK-6604
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Weizhong
Assignee: Apache Spark
Priority: Minor

 In driver now will start a server socket and use a wildcard ip, use 127.0.0.0 
 is more reasonable, as we only use it by local Python process



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6605) Same transformation in DStream leads to different result

2015-03-30 Thread SaintBacchus (JIRA)
SaintBacchus created SPARK-6605:
---

 Summary: Same transformation in DStream leads to different result
 Key: SPARK-6605
 URL: https://issues.apache.org/jira/browse/SPARK-6605
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: SaintBacchus
 Fix For: 1.4.0


The transformation *reduceByKeyAndWindow* has two implementations: one use the 
*WindowDstream* and the other use *ReducedWindowedDStream*.
But the result always is the same, except when an empty windows occurs.
As a wordcount example, if a period of time (larger than window time) has no 
data coming, the first *reduceByKeyAndWindow*  has no elem inside but the 
second has many elem with the zero value inside.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5895) Add VectorSlicer

2015-03-30 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386365#comment-14386365
 ] 

Xusen Yin edited comment on SPARK-5895 at 3/30/15 8:14 AM:
---

I have another concern here.

We can not reveal each column name of a `Vector`. Given the selected features 
age and salary, how to select these two columns from a vector? One solution 
is giving it a list of column names, say, `setColumnNames(List[String])`. But a 
more natural way to solve it is adding the list of column names in the 
`VectorAssembler`. 


was (Author: yinxusen):
I have another concern here.

We can not reveal each column name of a `Vector`. Given the selected features 
age and salary, how to select these two columns from a vector? One solution 
is giving it a list of column names, say, `setColumnNames(List[String])`. But a 
more natural way to solve it is adding the list of column names in 
`VectorAssembler`. 

 Add VectorSlicer
 

 Key: SPARK-5895
 URL: https://issues.apache.org/jira/browse/SPARK-5895
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng

 `VectorSlicer` takes a vector column and output a vector column with a subset 
 of features.
 {code}
 val vs = new VectorSlicer()
   .setInputCol(user)
   .setSelectedFeatures(age, salary)
   .setOutputCol(usefulUserFeatures)
 {code}
 We should allow specifying selected features by indices and by names. It 
 should preserve the output names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5894) Add PolynomialMapper

2015-03-30 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386346#comment-14386346
 ] 

Xusen Yin commented on SPARK-5894:
--

[~mengxr] [~josephkb] Do you have time to check it? Thanks!

 Add PolynomialMapper
 

 Key: SPARK-5894
 URL: https://issues.apache.org/jira/browse/SPARK-5894
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng

 `PolynomialMapper` takes a vector column and outputs a vector column with 
 polynomial feature mapping.
 {code}
 val poly = new PolynomialMapper()
   .setInputCol(features)
   .setDegree(2)
   .setOutputCols(polyFeatures)
 {code}
 It should handle the output feature names properly. Maybe we can make a 
 better name for it instead of calling it `PolynomialMapper`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6606) Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd object.

2015-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386355#comment-14386355
 ] 

Apache Spark commented on SPARK-6606:
-

User 'suyanNone' has created a pull request for this issue:
https://github.com/apache/spark/pull/5259

 Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd 
 object.
 -

 Key: SPARK-6606
 URL: https://issues.apache.org/jira/browse/SPARK-6606
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0, 1.3.0
Reporter: SuYan

 1. Use code like belows, will found accumulator deserialized twice.
 first: 
 {code}
 task = ser.deserialize[Task[Any]](taskBytes, 
 Thread.currentThread.getContextClassLoader)
 {code}
 second:
 {code}
 val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
   ByteBuffer.wrap(taskBinary.value), 
 Thread.currentThread.getContextClassLoader)
 {code}
 which the first deserialized is not what expected.
 because ResultTask or ShuffleMapTask will have a partition object.
 in class 
 {code}
 CoGroupedRDD[K](@transient var rdds: Seq[RDD[_ : Product2[K, _]]], part: 
 Partitioner)
 {code}, the CogroupPartition may contains a  CoGroupDep:
 {code}
 NarrowCoGroupSplitDep(
 rdd: RDD[_],
 splitIndex: Int,
 var split: Partition
   ) extends CoGroupSplitDep {
 {code}
 in that *NarrowCoGroupSplitDep*, it will bring into rdd object, which result 
 into the first deserialized.
 example:
 {code}
val acc1 = sc.accumulator(0, test1)
 val acc2 = sc.accumulator(0, test2)
 val rdd1 = sc.parallelize((1 to 10).toSeq, 3)
 val rdd2 = sc.parallelize((1 to 10).toSeq, 3)
 val combine1 = rdd1.map { case a = (a, 1)}.combineByKey(a = {
   acc1 += 1
   a
 }, (a: Int, b: Int) = {
   a + b
 },
   (a: Int, b: Int) = {
 a + b
   }, new HashPartitioner(3), mapSideCombine = false)
 val combine2 = rdd2.map { case a = (a, 1)}.combineByKey(
   a = {
 acc2 += 1
 a
   },
   (a: Int, b: Int) = {
 a + b
   },
   (a: Int, b: Int) = {
 a + b
   }, new HashPartitioner(3), mapSideCombine = false)
 combine1.cogroup(combine2, new HashPartitioner(3)).count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6606) Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd object.

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6606:
---

Assignee: Apache Spark

 Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd 
 object.
 -

 Key: SPARK-6606
 URL: https://issues.apache.org/jira/browse/SPARK-6606
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0, 1.3.0
Reporter: SuYan
Assignee: Apache Spark

 1. Use code like belows, will found accumulator deserialized twice.
 first: 
 {code}
 task = ser.deserialize[Task[Any]](taskBytes, 
 Thread.currentThread.getContextClassLoader)
 {code}
 second:
 {code}
 val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
   ByteBuffer.wrap(taskBinary.value), 
 Thread.currentThread.getContextClassLoader)
 {code}
 which the first deserialized is not what expected.
 because ResultTask or ShuffleMapTask will have a partition object.
 in class 
 {code}
 CoGroupedRDD[K](@transient var rdds: Seq[RDD[_ : Product2[K, _]]], part: 
 Partitioner)
 {code}, the CogroupPartition may contains a  CoGroupDep:
 {code}
 NarrowCoGroupSplitDep(
 rdd: RDD[_],
 splitIndex: Int,
 var split: Partition
   ) extends CoGroupSplitDep {
 {code}
 in that *NarrowCoGroupSplitDep*, it will bring into rdd object, which result 
 into the first deserialized.
 example:
 {code}
val acc1 = sc.accumulator(0, test1)
 val acc2 = sc.accumulator(0, test2)
 val rdd1 = sc.parallelize((1 to 10).toSeq, 3)
 val rdd2 = sc.parallelize((1 to 10).toSeq, 3)
 val combine1 = rdd1.map { case a = (a, 1)}.combineByKey(a = {
   acc1 += 1
   a
 }, (a: Int, b: Int) = {
   a + b
 },
   (a: Int, b: Int) = {
 a + b
   }, new HashPartitioner(3), mapSideCombine = false)
 val combine2 = rdd2.map { case a = (a, 1)}.combineByKey(
   a = {
 acc2 += 1
 a
   },
   (a: Int, b: Int) = {
 a + b
   },
   (a: Int, b: Int) = {
 a + b
   }, new HashPartitioner(3), mapSideCombine = false)
 combine1.cogroup(combine2, new HashPartitioner(3)).count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6594) Spark Streaming can't receive data from kafka

2015-03-30 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-6594.

Resolution: Not a Problem

 Spark Streaming can't receive data from kafka
 -

 Key: SPARK-6594
 URL: https://issues.apache.org/jira/browse/SPARK-6594
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.1
 Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1
Reporter: q79969786

 I use KafkaUtils to receive data from Kafka In my Spark streaming application 
 as follows:
 MapString, Integer topicorder = new HashMapString, Integer();
 topicorder.put(order, Integer.valueOf(readThread));
 JavaPairReceiverInputDStreamString, String jPRIDSOrder = 
 KafkaUtils.createStream(jssc, zkQuorum, group, topicorder);
 It worked well at fist, but after I submit this application several times, 
 Spark streaming can‘t  receive data anymore(Kafka works well).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6528) IDF transformer

2015-03-30 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386460#comment-14386460
 ] 

Xusen Yin commented on SPARK-6528:
--

[~josephkb] Pls assign it to me.

 IDF transformer
 ---

 Key: SPARK-6528
 URL: https://issues.apache.org/jira/browse/SPARK-6528
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xusen Yin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6401) Unable to load a old API input format in Spark streaming

2015-03-30 Thread Thomas F. (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386491#comment-14386491
 ] 

Thomas F. commented on SPARK-6401:
--

How do we proceed for this issue ?

As we already have in the DStream the saveAsHadoopFiles with historical 
OutputFormat and saveAsNewAPIHadoopFiles with NewOutputFormat, do we rename 
StreamingContext.fileStream() into StreamingContext.newAPIHadoopFileStream(with 
NewInputFormat) and then add hadoopFileStream(with InputFormat) to be 
completely aligned with Spark Core for hadoop input/output ?

Brgds.


 Unable to load a old API input format in Spark streaming
 

 Key: SPARK-6401
 URL: https://issues.apache.org/jira/browse/SPARK-6401
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Rémy DUBOIS
Priority: Minor

 The fileStream method of the JavaStreamingContext class does not allow using 
 a old API InputFormat.
 This feature exists in Spark batch but not in streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6606) Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd object.

2015-03-30 Thread SuYan (JIRA)
SuYan created SPARK-6606:


 Summary: Accumulator deserialized twice because the 
NarrowCoGroupSplitDep contains rdd object.
 Key: SPARK-6606
 URL: https://issues.apache.org/jira/browse/SPARK-6606
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.2.0
Reporter: SuYan


1. Use code like belows, will found accumulator deserialized twice.
first: 
{code}
task = ser.deserialize[Task[Any]](taskBytes, 
Thread.currentThread.getContextClassLoader)
{code}
second:
{code}
val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
  ByteBuffer.wrap(taskBinary.value), 
Thread.currentThread.getContextClassLoader)
{code}
which the first deserialized is not what expected.

because ResultTask or ShuffleMapTask will have a partition object.
in class 
{code}
CoGroupedRDD[K](@transient var rdds: Seq[RDD[_ : Product2[K, _]]], part: 
Partitioner)
{code}, the CogroupPartition may contains a  CoGroupDep:
{code}
NarrowCoGroupSplitDep(
rdd: RDD[_],
splitIndex: Int,
var split: Partition
  ) extends CoGroupSplitDep {
{code}

in that *NarrowCoGroupSplitDep*, it will bring into rdd object, which result 
into the first deserialized.

example:
{code}
   val acc1 = sc.accumulator(0, test1)
val acc2 = sc.accumulator(0, test2)
val rdd1 = sc.parallelize((1 to 10).toSeq, 3)
val rdd2 = sc.parallelize((1 to 10).toSeq, 3)
val combine1 = rdd1.map { case a = (a, 1)}.combineByKey(a = {
  acc1 += 1
  a
}, (a: Int, b: Int) = {
  a + b
},
  (a: Int, b: Int) = {
a + b
  }, new HashPartitioner(3), mapSideCombine = false)

val combine2 = rdd2.map { case a = (a, 1)}.combineByKey(
  a = {
acc2 += 1
a
  },
  (a: Int, b: Int) = {
a + b
  },
  (a: Int, b: Int) = {
a + b
  }, new HashPartitioner(3), mapSideCombine = false)

combine1.cogroup(combine2, new HashPartitioner(3)).count()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6258) Python MLlib API missing items: Clustering

2015-03-30 Thread Hrishikesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386394#comment-14386394
 ] 

Hrishikesh commented on SPARK-6258:
---

Hi [~josephkb]
I am a newbie to spark and I would like to contribute. Could you assign this 
ticket to me?

 Python MLlib API missing items: Clustering
 --

 Key: SPARK-6258
 URL: https://issues.apache.org/jira/browse/SPARK-6258
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 KMeans
 * setEpsilon
 * setInitializationSteps
 KMeansModel
 * computeCost
 * k
 GaussianMixture
 * setInitialModel
 GaussianMixtureModel
 * k
 Completely missing items which should be fixed in separate JIRAs (which have 
 been created and linked to the umbrella JIRA)
 * LDA
 * PowerIterationClustering
 * StreamingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6401) Unable to load a old API input format in Spark streaming

2015-03-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386499#comment-14386499
 ] 

Sean Owen commented on SPARK-6401:
--

Since it's technically an API change to streaming I'd look for a nod from 
[~tdas] before proceeding.

 Unable to load a old API input format in Spark streaming
 

 Key: SPARK-6401
 URL: https://issues.apache.org/jira/browse/SPARK-6401
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Rémy DUBOIS
Priority: Minor

 The fileStream method of the JavaStreamingContext class does not allow using 
 a old API InputFormat.
 This feature exists in Spark batch but not in streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5895) Add VectorSlicer

2015-03-30 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386365#comment-14386365
 ] 

Xusen Yin commented on SPARK-5895:
--

I have another concern here.

We can not reveal each column name of a `Vector`. Given the selected features 
age and salary, how to select these two columns from a vector? One solution 
is giving it a list of column names, say, `setColumnNames(List[String])`. But a 
more natural way to solve it is adding the list of column names in 
`VectorAssembler`. 

 Add VectorSlicer
 

 Key: SPARK-5895
 URL: https://issues.apache.org/jira/browse/SPARK-5895
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng

 `VectorSlicer` takes a vector column and output a vector column with a subset 
 of features.
 {code}
 val vs = new VectorSlicer()
   .setInputCol(user)
   .setSelectedFeatures(age, salary)
   .setOutputCol(usefulUserFeatures)
 {code}
 We should allow specifying selected features by indices and by names. It 
 should preserve the output names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6594) Spark Streaming can't receive data from kafka

2015-03-30 Thread q79969786 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386380#comment-14386380
 ] 

q79969786 edited comment on SPARK-6594 at 3/30/15 8:25 AM:
---

thanks, it worked after restart kafka


was (Author: q79969786):
thanks, it worked after reboot kafka

 Spark Streaming can't receive data from kafka
 -

 Key: SPARK-6594
 URL: https://issues.apache.org/jira/browse/SPARK-6594
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.1
 Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1
Reporter: q79969786

 I use KafkaUtils to receive data from Kafka In my Spark streaming application 
 as follows:
 MapString, Integer topicorder = new HashMapString, Integer();
 topicorder.put(order, Integer.valueOf(readThread));
 JavaPairReceiverInputDStreamString, String jPRIDSOrder = 
 KafkaUtils.createStream(jssc, zkQuorum, group, topicorder);
 It worked well at fist, but after I submit this application several times, 
 Spark streaming can‘t  receive data anymore(Kafka works well).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6594) Spark Streaming can't receive data from kafka

2015-03-30 Thread q79969786 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386380#comment-14386380
 ] 

q79969786 commented on SPARK-6594:
--

thanks, it worked after reboot kafka

 Spark Streaming can't receive data from kafka
 -

 Key: SPARK-6594
 URL: https://issues.apache.org/jira/browse/SPARK-6594
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.1
 Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1
Reporter: q79969786

 I use KafkaUtils to receive data from Kafka In my Spark streaming application 
 as follows:
 MapString, Integer topicorder = new HashMapString, Integer();
 topicorder.put(order, Integer.valueOf(readThread));
 JavaPairReceiverInputDStreamString, String jPRIDSOrder = 
 KafkaUtils.createStream(jssc, zkQuorum, group, topicorder);
 It worked well at fist, but after I submit this application several times, 
 Spark streaming can‘t  receive data anymore(Kafka works well).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6605) Same transformation in DStream leads to different result

2015-03-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386510#comment-14386510
 ] 

Sean Owen commented on SPARK-6605:
--

What other implementation are you referring to? I can't see one that uses 
{{WindowedDStream}}. Are you talking about calling {{window}} and 
{{reduceByKey}} separately? What do you mean by many zeroes -- if there is no 
data, what could be the key?

 Same transformation in DStream leads to different result
 

 Key: SPARK-6605
 URL: https://issues.apache.org/jira/browse/SPARK-6605
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: SaintBacchus
 Fix For: 1.4.0


 The transformation *reduceByKeyAndWindow* has two implementations: one use 
 the *WindowDstream* and the other use *ReducedWindowedDStream*.
 But the result always is the same, except when an empty windows occurs.
 As a wordcount example, if a period of time (larger than window time) has no 
 data coming, the first *reduceByKeyAndWindow*  has no elem inside but the 
 second has many elem with the zero value inside.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4226) SparkSQL - Add support for subqueries in predicates

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4226:
---

Assignee: Apache Spark

 SparkSQL - Add support for subqueries in predicates
 ---

 Key: SPARK-4226
 URL: https://issues.apache.org/jira/browse/SPARK-4226
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
 Environment: Spark 1.2 snapshot
Reporter: Terry Siu
Assignee: Apache Spark

 I have a test table defined in Hive as follows:
 CREATE TABLE sparkbug (
   id INT,
   event STRING
 ) STORED AS PARQUET;
 and insert some sample data with ids 1, 2, 3.
 In a Spark shell, I then create a HiveContext and then execute the following 
 HQL to test out subquery predicates:
 val hc = HiveContext(hc)
 hc.hql(select customerid from sparkbug where customerid in (select 
 customerid from sparkbug where customerid in (2,3)))
 I get the following error:
 java.lang.RuntimeException: Unsupported language features in query: select 
 customerid from sparkbug where customerid in (select customerid from sparkbug 
 where customerid in (2,3))
 TOK_QUERY
   TOK_FROM
 TOK_TABREF
   TOK_TABNAME
 sparkbug
   TOK_INSERT
 TOK_DESTINATION
   TOK_DIR
 TOK_TMP_FILE
 TOK_SELECT
   TOK_SELEXPR
 TOK_TABLE_OR_COL
   customerid
 TOK_WHERE
   TOK_SUBQUERY_EXPR
 TOK_SUBQUERY_OP
   in
 TOK_QUERY
   TOK_FROM
 TOK_TABREF
   TOK_TABNAME
 sparkbug
   TOK_INSERT
 TOK_DESTINATION
   TOK_DIR
 TOK_TMP_FILE
 TOK_SELECT
   TOK_SELEXPR
 TOK_TABLE_OR_COL
   customerid
 TOK_WHERE
   TOK_FUNCTION
 in
 TOK_TABLE_OR_COL
   customerid
 2
 3
 TOK_TABLE_OR_COL
   customerid
 scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
 TOK_SUBQUERY_EXPR :
 TOK_SUBQUERY_EXPR
   TOK_SUBQUERY_OP
 in
   TOK_QUERY
 TOK_FROM
   TOK_TABREF
 TOK_TABNAME
   sparkbug
 TOK_INSERT
   TOK_DESTINATION
 TOK_DIR
   TOK_TMP_FILE
   TOK_SELECT
 TOK_SELEXPR
   TOK_TABLE_OR_COL
 customerid
   TOK_WHERE
 TOK_FUNCTION
   in
   TOK_TABLE_OR_COL
 customerid
   2
   3
   TOK_TABLE_OR_COL
 customerid
  +
  
 org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)
 
 at scala.sys.package$.error(package.scala:27)
 at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
 at 
 org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
 at 
 org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
 at 
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
 This thread
 http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html
 also brings up lack of subquery support in SparkSQL. It would be nice to have 
 subquery predicate support in a near, future release (1.3, maybe?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6608) Make DataFrame.rdd a lazy val

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6608:
---

Assignee: (was: Apache Spark)

 Make DataFrame.rdd a lazy val
 -

 Key: SPARK-6608
 URL: https://issues.apache.org/jira/browse/SPARK-6608
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Priority: Minor

 Before 1.3.0, {{SchemaRDD.id}} works as a unique identifier of each 
 {{SchemaRDD}}. In 1.3.0, unlike {{SchemaRDD}}, {{DataFrame}} is no longer an 
 RDD, and {{DataFrame.rdd}} is actually a function which always return a new 
 RDD instance. Making {{DataFrame.rdd}} a {{lazy val}} should bring the unique 
 identifier back.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6608) Make DataFrame.rdd a lazy val

2015-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386635#comment-14386635
 ] 

Apache Spark commented on SPARK-6608:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/5265

 Make DataFrame.rdd a lazy val
 -

 Key: SPARK-6608
 URL: https://issues.apache.org/jira/browse/SPARK-6608
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Priority: Minor

 Before 1.3.0, {{SchemaRDD.id}} works as a unique identifier of each 
 {{SchemaRDD}}. In 1.3.0, unlike {{SchemaRDD}}, {{DataFrame}} is no longer an 
 RDD, and {{DataFrame.rdd}} is actually a function which always return a new 
 RDD instance. Making {{DataFrame.rdd}} a {{lazy val}} should bring the unique 
 identifier back.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6608) Make DataFrame.rdd a lazy val

2015-03-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-6608:
-

Assignee: Cheng Lian

 Make DataFrame.rdd a lazy val
 -

 Key: SPARK-6608
 URL: https://issues.apache.org/jira/browse/SPARK-6608
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor

 Before 1.3.0, {{SchemaRDD.id}} works as a unique identifier of each 
 {{SchemaRDD}}. In 1.3.0, unlike {{SchemaRDD}}, {{DataFrame}} is no longer an 
 RDD, and {{DataFrame.rdd}} is actually a function which always return a new 
 RDD instance. Making {{DataFrame.rdd}} a {{lazy val}} should bring the unique 
 identifier back.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-03-30 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386636#comment-14386636
 ] 

Kai Sasaki commented on SPARK-4036:
---

[~mengxr] I write a design doc based on your advice. Thank you.

 Add Conditional Random Fields (CRF) algorithm to Spark MLlib
 

 Key: SPARK-4036
 URL: https://issues.apache.org/jira/browse/SPARK-4036
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Kai Sasaki

 Conditional random fields (CRFs) are a class of statistical modelling method 
 often applied in pattern recognition and machine learning, where they are 
 used for structured prediction. 
 The paper: 
 http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6528) IDF transformer

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6528:
---

Assignee: Apache Spark

 IDF transformer
 ---

 Key: SPARK-6528
 URL: https://issues.apache.org/jira/browse/SPARK-6528
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xusen Yin
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6528) IDF transformer

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6528:
---

Assignee: (was: Apache Spark)

 IDF transformer
 ---

 Key: SPARK-6528
 URL: https://issues.apache.org/jira/browse/SPARK-6528
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xusen Yin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6528) IDF transformer

2015-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386664#comment-14386664
 ] 

Apache Spark commented on SPARK-6528:
-

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5266

 IDF transformer
 ---

 Key: SPARK-6528
 URL: https://issues.apache.org/jira/browse/SPARK-6528
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xusen Yin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6607) Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema

2015-03-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6607:
--
Assignee: Liang-Chi Hsieh

 Aggregation attribute name including special chars '(' and ')' should be 
 replaced before generating Parquet schema
 --

 Key: SPARK-6607
 URL: https://issues.apache.org/jira/browse/SPARK-6607
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh

 '(' and ')' are special characters used in Parquet schema for type 
 annotation. When we run an aggregation query, we will obtain attribute name 
 such as MAX(a).
 If we directly store the generated DataFrame as Parquet file, it causes 
 failure when reading and parsing the stored schema string.
 Several methods can be adopted to solve this. This pr uses a simplest one to 
 just replace attribute names before generating Parquet schema based on these 
 attributes.
 Another possible method might be modifying all aggregation expression names 
 from func(column) to func[column].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6609) explicit checkpoint does not work

2015-03-30 Thread lisendong (JIRA)
lisendong created SPARK-6609:


 Summary: explicit checkpoint does not work
 Key: SPARK-6609
 URL: https://issues.apache.org/jira/browse/SPARK-6609
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.3.0
Reporter: lisendong
Priority: Critical
 Fix For: 1.3.0


https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala


from the code, I found the explicit feedback ALS does not do checkpoint()





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6610) explicit checkpoint does not work

2015-03-30 Thread lisendong (JIRA)
lisendong created SPARK-6610:


 Summary: explicit checkpoint does not work
 Key: SPARK-6610
 URL: https://issues.apache.org/jira/browse/SPARK-6610
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.3.0
Reporter: lisendong
Priority: Critical
 Fix For: 1.3.0


https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala


from the code, I found the explicit feedback ALS does not do checkpoint()





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6610) mlllib explicit ALS checkpoint does not work

2015-03-30 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-6610:
---
Target Version/s: 1.3.1  (was: 1.3.0)
   Fix Version/s: (was: 1.3.0)

 mlllib explicit ALS checkpoint does not work
 

 Key: SPARK-6610
 URL: https://issues.apache.org/jira/browse/SPARK-6610
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.3.0
Reporter: lisendong
Priority: Critical
  Labels: spark
   Original Estimate: 24h
  Remaining Estimate: 24h

 https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
 from the code, I found the explicit feedback ALS does not do checkpoint()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6607) Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema

2015-03-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6607:
--
 Target Version/s: 1.4.0
Affects Version/s: 1.1.1
   1.2.1
   1.3.0

 Aggregation attribute name including special chars '(' and ')' should be 
 replaced before generating Parquet schema
 --

 Key: SPARK-6607
 URL: https://issues.apache.org/jira/browse/SPARK-6607
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.1, 1.2.1, 1.3.0
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh

 '(' and ')' are special characters used in Parquet schema for type 
 annotation. When we run an aggregation query, we will obtain attribute name 
 such as MAX(a).
 If we directly store the generated DataFrame as Parquet file, it causes 
 failure when reading and parsing the stored schema string.
 Several methods can be adopted to solve this. This pr uses a simplest one to 
 just replace attribute names before generating Parquet schema based on these 
 attributes.
 Another possible method might be modifying all aggregation expression names 
 from func(column) to func[column].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3530) Pipeline and Parameters

2015-03-30 Thread Jao Rabary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386747#comment-14386747
 ] 

Jao Rabary commented on SPARK-3530:
---

Yes, the scenario is to instantiate a pre-trained caffe network. The problem 
with the broadcast is that I use a JNI binding of caffe and spark isn't able to 
serialize the object.



 Pipeline and Parameters
 ---

 Key: SPARK-3530
 URL: https://issues.apache.org/jira/browse/SPARK-3530
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
 Fix For: 1.2.0


 This part of the design doc is for pipelines and parameters. I put the design 
 doc at
 https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
 I will copy the proposed interfaces to this JIRA later. Some sample code can 
 be viewed at: https://github.com/mengxr/spark-ml/
 Please help review the design and post your comments here. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6517) Implement the Algorithm of Hierarchical Clustering

2015-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386707#comment-14386707
 ] 

Apache Spark commented on SPARK-6517:
-

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/5267

 Implement the Algorithm of Hierarchical Clustering
 --

 Key: SPARK-6517
 URL: https://issues.apache.org/jira/browse/SPARK-6517
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Yu Ishikawa





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6517) Implement the Algorithm of Hierarchical Clustering

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6517:
---

Assignee: Apache Spark

 Implement the Algorithm of Hierarchical Clustering
 --

 Key: SPARK-6517
 URL: https://issues.apache.org/jira/browse/SPARK-6517
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Yu Ishikawa
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6517) Implement the Algorithm of Hierarchical Clustering

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6517:
---

Assignee: (was: Apache Spark)

 Implement the Algorithm of Hierarchical Clustering
 --

 Key: SPARK-6517
 URL: https://issues.apache.org/jira/browse/SPARK-6517
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Yu Ishikawa





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6610) mlllib explicit ALS checkpoint does not work

2015-03-30 Thread lisendong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lisendong updated SPARK-6610:
-
Summary: mlllib explicit ALS checkpoint does not work  (was: explicit 
checkpoint does not work)

 mlllib explicit ALS checkpoint does not work
 

 Key: SPARK-6610
 URL: https://issues.apache.org/jira/browse/SPARK-6610
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.3.0
Reporter: lisendong
Priority: Critical
  Labels: spark
 Fix For: 1.3.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
 from the code, I found the explicit feedback ALS does not do checkpoint()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6517) Implement the Algorithm of Hierarchical Clustering

2015-03-30 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386756#comment-14386756
 ] 

Yu Ishikawa commented on SPARK-6517:


Hi [~freeman-lab] and [~rnowling],
Would you review the PR when you have time. The basic idea is not changed. The 
main difference between the new one and the old one is parallel processing of 
dividing clusters. The old version divides leaf nodes of a cluster tree 
separately. However, the new version divides leaf clusters simultaneously and 
then build a cluster tree.. As a result of the modification it is 1000x faster 
than the old one.
Thanks

 Implement the Algorithm of Hierarchical Clustering
 --

 Key: SPARK-6517
 URL: https://issues.apache.org/jira/browse/SPARK-6517
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Yu Ishikawa





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6610) mlllib explicit ALS checkpoint does not work

2015-03-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6610.
--
  Resolution: Duplicate
Target Version/s:   (was: 1.3.1)

 mlllib explicit ALS checkpoint does not work
 

 Key: SPARK-6610
 URL: https://issues.apache.org/jira/browse/SPARK-6610
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.3.0
Reporter: lisendong
Priority: Critical
  Labels: spark
   Original Estimate: 24h
  Remaining Estimate: 24h

 https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
 from the code, I found the explicit feedback ALS does not do checkpoint()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3276) Provide a API to specify whether the old files need to be ignored in file input text DStream

2015-03-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386775#comment-14386775
 ] 

Emre Sevinç commented on SPARK-3276:


Any plans to make the private val {{FileInputDStream.MIN_REMEMBER_DURATION}} 
configurable via some API?

It seems to be hard-coded as 1 minute in 
https://github.com/apache/spark/blob/branch-1.2/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L325,
 and this leads to files older than 1 minute not to be processed.

 Provide a API to specify whether the old files need to be ignored in file 
 input text DStream
 

 Key: SPARK-3276
 URL: https://issues.apache.org/jira/browse/SPARK-3276
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Jack Hu
Priority: Minor

 Currently, only one API called textFileStream in StreamingContext to specify 
 the text file dstream, which ignores the old files always. On some times, the 
 old files is still useful.
 Need a API to let user choose whether the old files need to be ingored or not 
 .
 The API currently in StreamingContext:
 def textFileStream(directory: String): DStream[String] = {
 fileStream[LongWritable, Text, 
 TextInputFormat](directory).map(_._2.toString)
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time

2015-03-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386774#comment-14386774
 ] 

Emre Sevinç commented on SPARK-6061:


Any plans to make the private val {{FileInputDStream.MIN_REMEMBER_DURATION}} 
configurable via some API?

It seems to be hard-coded as 1 minute in 
https://github.com/apache/spark/blob/branch-1.2/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L325,
 and this leads to files older than 1 minute not to be processed.

 File source dstream can not include the old file which timestamp is before 
 the system time
 --

 Key: SPARK-6061
 URL: https://issues.apache.org/jira/browse/SPARK-6061
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Jack Hu
  Labels: FileSourceDStream, OlderFiles, Streaming
   Original Estimate: 1m
  Remaining Estimate: 1m

 The file source dstream (StreamContext.fileStream) has a properties named 
 newFilesOnly to include the old files, it worked fine with 1.1.0, and 
 broken at 1.2.1, the older files always be ignored no mattern what value is 
 set.  
 Here is the simple reproduce code:
 https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb
 The reason is that: the modTimeIgnoreThreshold in 
 FileInputDStream::findNewFiles is set to a time closed to system time (Spark 
 Streaming Clock time), so the files old than this time are ignored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6609) explicit checkpoint does not work

2015-03-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6609:
-
Priority: Minor  (was: Critical)
Target Version/s:   (was: 1.3.0)
   Fix Version/s: (was: 1.3.0)
  Issue Type: Improvement  (was: Bug)

Please don't set Fix Version as it's not resolved. This is a follow on to 
SPARK-5955. It is not a critical bug since it works fine for most use cases 
without this.

 explicit checkpoint does not work
 -

 Key: SPARK-6609
 URL: https://issues.apache.org/jira/browse/SPARK-6609
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.3.0
Reporter: lisendong
Priority: Minor
   Original Estimate: 24h
  Remaining Estimate: 24h

 https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
 from the code, I found the explicit feedback ALS does not do checkpoint()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6609) explicit checkpoint does not work

2015-03-30 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386781#comment-14386781
 ] 

Guoqiang Li commented on SPARK-6609:


We should merge [the PR 5076|https://github.com/apache/spark/pull/5076] to 
branch-1.3?

 explicit checkpoint does not work
 -

 Key: SPARK-6609
 URL: https://issues.apache.org/jira/browse/SPARK-6609
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.3.0
Reporter: lisendong
Priority: Minor
   Original Estimate: 24h
  Remaining Estimate: 24h

 https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
 from the code, I found the explicit feedback ALS does not do checkpoint()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6568) spark-shell.cmd --jars option does not accept the jar that has space in its path

2015-03-30 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386552#comment-14386552
 ] 

Steve Loughran commented on SPARK-6568:
---

Can you show the full stack trace?

 spark-shell.cmd --jars option does not accept the jar that has space in its 
 path
 

 Key: SPARK-6568
 URL: https://issues.apache.org/jira/browse/SPARK-6568
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Windows
Affects Versions: 1.3.0
 Environment: Windows 8.1
Reporter: Masayoshi TSUZUKI

 spark-shell.cmd --jars option does not accept the jar that has space in its 
 path.
 The path of jar sometimes containes space in Windows.
 {code}
 bin\spark-shell.cmd --jars C:\Program Files\some\jar1.jar
 {code}
 this gets
 {code}
 Exception in thread main java.net.URISyntaxException: Illegal character in 
 path at index 10: C:/Program Files/some/jar1.jar
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6282) Strange Python import error when using random() in a lambda function

2015-03-30 Thread Dmytro Bielievtsov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386564#comment-14386564
 ] 

Dmytro Bielievtsov commented on SPARK-6282:
---

I had the same strange issue when computing rdd's via blaze interface 
(ContinuumIO). As strangely as it appeared, It strangely disappeared after I 
removed 'six*' files from '/usr/lib/python-2.7/dist-packages' and then 
reinstalled 'six' with easy_install-2.7

 Strange Python import error when using random() in a lambda function
 

 Key: SPARK-6282
 URL: https://issues.apache.org/jira/browse/SPARK-6282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Kubuntu 14.04, Python 2.7.6
Reporter: Pavel Laskov
Priority: Minor

 Consider the exemplary Python code below:
from random import random
from pyspark.context import SparkContext
from xval_mllib import read_csv_file_as_list
 if __name__ == __main__: 
 sc = SparkContext(appName=Random() bug test)
 data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv'))
 #data = sc.parallelize([1, 2, 3, 4, 5], 2)
 d = data.map(lambda x: (random(), x))
 print d.first()
 Data is read from a large CSV file. Running this code results in a Python 
 import error:
 ImportError: No module named _winreg
 If I use 'import random' and 'random.random()' in the lambda function no 
 error occurs. Also no error occurs, for both kinds of import statements, for 
 a small artificial data set like the one shown in a commented line.  
 The full error trace, the source code of csv reading code (function 
 'read_csv_file_as_list' is my own) as well as a sample dataset (the original 
 dataset is about 8M large) can be provided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error

2015-03-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2348.
--
Resolution: Not a Problem

 In Windows having a enviorinment variable named 'classpath' gives error
 ---

 Key: SPARK-2348
 URL: https://issues.apache.org/jira/browse/SPARK-2348
 Project: Spark
  Issue Type: Bug
  Components: Windows
Affects Versions: 1.0.0
 Environment: Windows 7 Enterprise
Reporter: Chirag Todarka
Assignee: Chirag Todarka

 Operating System:: Windows 7 Enterprise
 If having enviorinment variable named 'classpath' gives then starting 
 'spark-shell' gives below error::
 mydir\spark\binspark-shell
 Failed to initialize compiler: object scala.runtime in compiler mirror not 
 found
 .
 ** Note that as of 2.8 scala does not assume use of the java classpath.
 ** For the old behavior pass -usejavacp to scala, or if using a Settings
 ** object programatically, settings.usejavacp.value = true.
 14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler 
 acces
 sed before init set up.  Assuming no postInit code.
 Failed to initialize compiler: object scala.runtime in compiler mirror not 
 found
 .
 ** Note that as of 2.8 scala does not assume use of the java classpath.
 ** For the old behavior pass -usejavacp to scala, or if using a Settings
 ** object programatically, settings.usejavacp.value = true.
 Exception in thread main java.lang.AssertionError: assertion failed: null
 at scala.Predef$.assert(Predef.scala:179)
 at 
 org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca
 la:202)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar
 kILoop.scala:929)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
 scala:884)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
 scala:884)
 at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass
 Loader.scala:135)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
 at java.lang.reflect.Method.invoke(Unknown Source)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6605) Same transformation in DStream leads to different result

2015-03-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386568#comment-14386568
 ] 

Sean Owen commented on SPARK-6605:
--

Thanks, that's very useful. I think the behavior is expected, but it's not 
obvious. I assume you are printing the RDD from a window with no data.

Both are giving the same answer in that both show no count, or 0, for every 
key. The second example just has an explicit 0 in two cases instead of all 
implicit 0. The more expected answer is the first one -- no results. The first 
version gets that exactly since it re-counts the whole window which has no data.

The second one is the result of the optimization offered by invFunc. It 
correctly finds the count is 0 in the current window for these two keys, but it 
has no notion that a count of 0 is the same as no value at all. You and I know 
that, and you could simply apply a filter() to remove these redundant entries 
if desired.

I'm not sure it's fixable in general without the user being able to supply a 
{{(V,V) = Option[V]}} instead or something as the {{invFunc}}. But it's not 
really getting the wrong answer either.

 Same transformation in DStream leads to different result
 

 Key: SPARK-6605
 URL: https://issues.apache.org/jira/browse/SPARK-6605
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: SaintBacchus
 Fix For: 1.4.0


 The transformation *reduceByKeyAndWindow* has two implementations: one use 
 the *WindowDstream* and the other use *ReducedWindowedDStream*.
 But the result always is the same, except when an empty windows occurs.
 As a wordcount example, if a period of time (larger than window time) has no 
 data coming, the first *reduceByKeyAndWindow*  has no elem inside but the 
 second has many elem with the zero value inside.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6605) Same transformation in DStream leads to different result

2015-03-30 Thread SaintBacchus (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386610#comment-14386610
 ] 

SaintBacchus commented on SPARK-6605:
-

Yeah, [~srowen] it's not a wrong answer but just a little different from what 
we expect. It's caused by two different implementations.
But I doubt whether we should fix it as the first case or let users deal with 
the empty result using *filter*.
If we want to fix it, setting the {{invFunc}} as {{(V,V) = Option\[V\]}} is a 
good idea or add a {{Filter Function}} is also OK for simple.

 Same transformation in DStream leads to different result
 

 Key: SPARK-6605
 URL: https://issues.apache.org/jira/browse/SPARK-6605
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: SaintBacchus
 Fix For: 1.4.0


 The transformation *reduceByKeyAndWindow* has two implementations: one use 
 the *WindowDstream* and the other use *ReducedWindowedDStream*.
 But the result always is the same, except when an empty windows occurs.
 As a wordcount example, if a period of time (larger than window time) has no 
 data coming, the first *reduceByKeyAndWindow*  has no elem inside but the 
 second has many elem with the zero value inside.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6603) SQLContext.registerFunction - SQLContext.udf.register

2015-03-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6603:
-
Component/s: SQL

 SQLContext.registerFunction - SQLContext.udf.register
 --

 Key: SPARK-6603
 URL: https://issues.apache.org/jira/browse/SPARK-6603
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu

 We didn't change the Python implementation to use that. Maybe the best 
 strategy is to deprecate SQLContext.registerFunction, and just add 
 SQLContext.udf.register.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5750) Document that ordering of elements in shuffled partitions is not deterministic across runs

2015-03-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5750:
-
Priority: Minor  (was: Major)

 Document that ordering of elements in shuffled partitions is not 
 deterministic across runs
 --

 Key: SPARK-5750
 URL: https://issues.apache.org/jira/browse/SPARK-5750
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Josh Rosen
Assignee: Ilya Ganelin
Priority: Minor
 Fix For: 1.3.1, 1.4.0


 The ordering of elements in shuffled partitions is not deterministic across 
 runs.  For instance, consider the following example:
 {code}
 val largeFiles = sc.textFile(...)
 val airlines = largeFiles.repartition(2000).cache()
 println(airlines.first)
 {code}
 If this code is run twice, then each run will output a different result.  
 There is non-determinism in the shuffle read code that accounts for this:
 Spark's shuffle read path processes blocks as soon as they are fetched  Spark 
 uses 
 [ShuffleBlockFetcherIterator|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala]
  to fetch shuffle data from mappers.  In this code, requests for multiple 
 blocks from the same host are batched together, so nondeterminism in where 
 tasks are run means that the set of requests can vary across runs.  In 
 addition, there's an [explicit 
 call|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L256]
  to randomize the order of the batched fetch requests.  As a result, shuffle 
 operations cannot be guaranteed to produce the same ordering of the elements 
 in their partitions.
 Therefore, Spark should update its docs to clarify that the ordering of 
 elements in shuffle RDDs' partitions is non-deterministic.  Note, however, 
 that the _set_ of elements in each partition will be deterministic: if we 
 used {{mapPartitions}} to sort each partition, then the {{first()}} call 
 above would produce a deterministic result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5836) Highlight in Spark documentation that by default Spark does not delete its temporary files

2015-03-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5836.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1
 Assignee: Ilya Ganelin

 Highlight in Spark documentation that by default Spark does not delete its 
 temporary files
 --

 Key: SPARK-5836
 URL: https://issues.apache.org/jira/browse/SPARK-5836
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Tomasz Dudziak
Assignee: Ilya Ganelin
 Fix For: 1.3.1, 1.4.0


 We recently learnt the hard way (in a prod system) that Spark by default does 
 not delete its temporary files until it is stopped. WIthin a relatively short 
 time span of heavy Spark use the disk of our prod machine filled up 
 completely because of multiple shuffle files written to it. We think there 
 should be better documentation around the fact that after a job is finished 
 it leaves a lot of rubbish behind so that this does not come as a surprise.
 Probably a good place to highlight that fact would be the documentation of 
 {{spark.local.dir}} property, which controls where Spark temporary files are 
 written. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6607) Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema

2015-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386560#comment-14386560
 ] 

Apache Spark commented on SPARK-6607:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/5263

 Aggregation attribute name including special chars '(' and ')' should be 
 replaced before generating Parquet schema
 --

 Key: SPARK-6607
 URL: https://issues.apache.org/jira/browse/SPARK-6607
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh

 '(' and ')' are special characters used in Parquet schema for type 
 annotation. When we run an aggregation query, we will obtain attribute name 
 such as MAX(a).
 If we directly store the generated DataFrame as Parquet file, it causes 
 failure when reading and parsing the stored schema string.
 Several methods can be adopted to solve this. This pr uses a simplest one to 
 just replace attribute names before generating Parquet schema based on these 
 attributes.
 Another possible method might be modifying all aggregation expression names 
 from func(column) to func[column].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6607) Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6607:
---

Assignee: (was: Apache Spark)

 Aggregation attribute name including special chars '(' and ')' should be 
 replaced before generating Parquet schema
 --

 Key: SPARK-6607
 URL: https://issues.apache.org/jira/browse/SPARK-6607
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh

 '(' and ')' are special characters used in Parquet schema for type 
 annotation. When we run an aggregation query, we will obtain attribute name 
 such as MAX(a).
 If we directly store the generated DataFrame as Parquet file, it causes 
 failure when reading and parsing the stored schema string.
 Several methods can be adopted to solve this. This pr uses a simplest one to 
 just replace attribute names before generating Parquet schema based on these 
 attributes.
 Another possible method might be modifying all aggregation expression names 
 from func(column) to func[column].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6607) Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6607:
---

Assignee: Apache Spark

 Aggregation attribute name including special chars '(' and ')' should be 
 replaced before generating Parquet schema
 --

 Key: SPARK-6607
 URL: https://issues.apache.org/jira/browse/SPARK-6607
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Apache Spark

 '(' and ')' are special characters used in Parquet schema for type 
 annotation. When we run an aggregation query, we will obtain attribute name 
 such as MAX(a).
 If we directly store the generated DataFrame as Parquet file, it causes 
 failure when reading and parsing the stored schema string.
 Several methods can be adopted to solve this. This pr uses a simplest one to 
 just replace attribute names before generating Parquet schema based on these 
 attributes.
 Another possible method might be modifying all aggregation expression names 
 from func(column) to func[column].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-799) Windows versions of the deploy scripts

2015-03-30 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386571#comment-14386571
 ] 

Steve Loughran commented on SPARK-799:
--

Proving python versions of the launcher scripts is probably a better approach 
to supporting windows than .cmd or .ps1 files

# {{cmd}} is a painfully dated shell language, as you note.
# Powershell is better, but not widely known in the java/scala dev space.
# Neither ps1 nor cmd files can be tested except in Windows

Python is cross-platform-ish enough that it can be tested on Unix systems too, 
and more likely to be maintained. It's not seamlessly cross-platform; 
propagating stdout/stderr from spawned java processes. It also provides the 
option of becoming the Unix entry point (the bash script simply invoking it), 
so that maintenance effort is shared, and testing becomes even more implicit.

 Windows versions of the deploy scripts
 --

 Key: SPARK-799
 URL: https://issues.apache.org/jira/browse/SPARK-799
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Windows
Reporter: Matei Zaharia
  Labels: Starter

 Although the Spark daemons run fine on Windows with run.cmd, the deploy 
 scripts (bin/start-all.sh and such) don't do so unless you have Cygwin. It 
 would be nice to make .cmd versions of those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6282) Strange Python import error when using random() in a lambda function

2015-03-30 Thread Dmytro Bielievtsov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386564#comment-14386564
 ] 

Dmytro Bielievtsov edited comment on SPARK-6282 at 3/30/15 11:18 AM:
-

I had the same strange issue when computing rdd's via blaze interface 
(ContinuumIO). As strangely as it appeared, It disappeared after I removed 
'six*' files from '/usr/lib/python-2.7/dist-packages' and then reinstalled 
'six' with easy_install-2.7


was (Author: dmytro):
I had the same strange issue when computing rdd's via blaze interface 
(ContinuumIO). As strangely as it appeared, It strangely disappeared after I 
removed 'six*' files from '/usr/lib/python-2.7/dist-packages' and then 
reinstalled 'six' with easy_install-2.7

 Strange Python import error when using random() in a lambda function
 

 Key: SPARK-6282
 URL: https://issues.apache.org/jira/browse/SPARK-6282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Kubuntu 14.04, Python 2.7.6
Reporter: Pavel Laskov
Priority: Minor

 Consider the exemplary Python code below:
from random import random
from pyspark.context import SparkContext
from xval_mllib import read_csv_file_as_list
 if __name__ == __main__: 
 sc = SparkContext(appName=Random() bug test)
 data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv'))
 #data = sc.parallelize([1, 2, 3, 4, 5], 2)
 d = data.map(lambda x: (random(), x))
 print d.first()
 Data is read from a large CSV file. Running this code results in a Python 
 import error:
 ImportError: No module named _winreg
 If I use 'import random' and 'random.random()' in the lambda function no 
 error occurs. Also no error occurs, for both kinds of import statements, for 
 a small artificial data set like the one shown in a commented line.  
 The full error trace, the source code of csv reading code (function 
 'read_csv_file_as_list' is my own) as well as a sample dataset (the original 
 dataset is about 8M large) can be provided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2015-03-30 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386585#comment-14386585
 ] 

Steve Loughran commented on SPARK-2356:
---

It's coming from {{  UserGroupInformation.setConfiguration(conf)}}; UGI is 
using Hadoop's {{StringUtils}} to do something, which then init's a static 
variable

{code}
public static final Pattern ENV_VAR_PATTERN = Shell.WINDOWS ?
WIN_ENV_VAR_PATTERN : SHELL_ENV_VAR_PATTERN;
{code}

And Hadoop utils shell, does some stuff in its constructor, which depends on 
winutils.exe being on the path.

convoluted, but there you go. HADOOP-11293 proposes factoring out the 
{{Shell.Windows}} code into something standalone...if that can be pushed into 
Hadoop 2.8 then this problem will go away from then on

 Exception: Could not locate executable null\bin\winutils.exe in the Hadoop 
 ---

 Key: SPARK-2356
 URL: https://issues.apache.org/jira/browse/SPARK-2356
 Project: Spark
  Issue Type: Bug
  Components: Windows
Affects Versions: 1.0.0
Reporter: Kostiantyn Kudriavtsev
Priority: Critical

 I'm trying to run some transformation on Spark, it works fine on cluster 
 (YARN, linux machines). However, when I'm trying to run it on local machine 
 (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file 
 from local filesystem):
 {code}
 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
 hadoop binary path
 java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
 Hadoop binaries.
   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
   at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
   at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
   at org.apache.hadoop.security.Groups.init(Groups.java:77)
   at 
 org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
   at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
   at org.apache.spark.SparkContext.init(SparkContext.scala:228)
   at org.apache.spark.SparkContext.init(SparkContext.scala:97)
 {code}
 It's happened because Hadoop config is initialized each time when spark 
 context is created regardless is hadoop required or not.
 I propose to add some special flag to indicate if hadoop config is required 
 (or start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6598) Python API for IDFModel

2015-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386619#comment-14386619
 ] 

Apache Spark commented on SPARK-6598:
-

User 'Lewuathe' has created a pull request for this issue:
https://github.com/apache/spark/pull/5264

 Python API for IDFModel
 ---

 Key: SPARK-6598
 URL: https://issues.apache.org/jira/browse/SPARK-6598
 Project: Spark
  Issue Type: Task
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Kai Sasaki
Priority: Minor
  Labels: MLLib,, Python

 This is the sub-task of 
 [SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254].
 Wrap IDFModel {{idf}} member function for pyspark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6598) Python API for IDFModel

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6598:
---

Assignee: (was: Apache Spark)

 Python API for IDFModel
 ---

 Key: SPARK-6598
 URL: https://issues.apache.org/jira/browse/SPARK-6598
 Project: Spark
  Issue Type: Task
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Kai Sasaki
Priority: Minor
  Labels: MLLib,, Python

 This is the sub-task of 
 [SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254].
 Wrap IDFModel {{idf}} member function for pyspark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6598) Python API for IDFModel

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6598:
---

Assignee: Apache Spark

 Python API for IDFModel
 ---

 Key: SPARK-6598
 URL: https://issues.apache.org/jira/browse/SPARK-6598
 Project: Spark
  Issue Type: Task
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Kai Sasaki
Assignee: Apache Spark
Priority: Minor
  Labels: MLLib,, Python

 This is the sub-task of 
 [SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254].
 Wrap IDFModel {{idf}} member function for pyspark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods

2015-03-30 Thread Kalle Jepsen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386625#comment-14386625
 ] 

Kalle Jepsen commented on SPARK-6189:
-

I do not really understand why the column names have to be accessible directly 
as attributes anyway. What advantage does this yield above indexing? This 
basically restricts us on the ASCII character set for column names, doesn't it? 
Data in the wild may have all kinds of weird field names, including special 
characters, umlauts, accents and whatnot. Automatic renaming isn't very nice 
too, for the very reason already pointed out by mgdadv. Also, we cannot simply 
replace all illegal characters by underscores. The fields {{'ä.ö'}} and 
{{'ä.ü'}} would both be renamed to {{'___'}}. Besides, leading underscores have 
a somewhat special meaning in Python, potentially resulting in further 
confusion.

I think {{df\['a.b'\]}} should definitely work, even if the columns contain 
non-ASCII characters and a warning should be issued when creating the 
DataFrame, informing the user that direct column access via attribute name will 
not work with the given column names.

 Pandas to DataFrame conversion should check field names for periods
 ---

 Key: SPARK-6189
 URL: https://issues.apache.org/jira/browse/SPARK-6189
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 Issue I ran into:  I imported an R dataset in CSV format into a Pandas 
 DataFrame and then use toDF() to convert that into a Spark DataFrame.  The R 
 dataset had a column with a period in it (column GNP.deflator in the 
 longley dataset).  When I tried to select it using the Spark DataFrame DSL, 
 I could not because the DSL thought the period was selecting a field within 
 GNP.
 Also, since GNP is another field's name, it gives an error which could be 
 obscure to users, complaining:
 {code}
 org.apache.spark.sql.AnalysisException: GetField is not valid on fields of 
 type DoubleType;
 {code}
 We should either handle periods in column names or check during loading and 
 warn/fail gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6605) Same transformation in DStream leads to different result

2015-03-30 Thread SaintBacchus (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386540#comment-14386540
 ] 

SaintBacchus commented on SPARK-6605:
-

Hi [~srowen], my test code is this :
{code:title=test.scala|borderStyle=solid}
val words = ssc.socketTextStream(sIp , sPort).flatMap(_.split( 
)).map(x = (x , 1))
val resultWindow3 = words.reduceByKeyAndWindow((a:Int,b:Int) = (a + 
b), Seconds(winDur), Seconds(slideDur) );
val resultWindow4 = words.reduceByKeyAndWindow(_ + _, _ - _, 
Seconds(winDur), Seconds(slideDur) );
{code}
This *resultWindow3* is implemented by
{code:title=PairDStreamFunctions.scala|borderStyle=solid}
val cleanedReduceFunc = ssc.sc.clean(reduceFunc)
self.reduceByKey(cleanedReduceFunc, partitioner)
.window(windowDuration, slideDuration)
.reduceByKey(cleanedReduceFunc, partitioner)
{code}
And *resultWindow4* is implemented by
{code:title=PairDStreamFunctions.scala|borderStyle=solid}
val cleanedReduceFunc = ssc.sc.clean(reduceFunc)
val cleanedInvReduceFunc = ssc.sc.clean(invReduceFunc)
val cleanedFilterFunc = if (filterFunc != null) 
Some(ssc.sc.clean(filterFunc)) else None
new ReducedWindowedDStream[K, V](
  self, cleanedReduceFunc, cleanedInvReduceFunc, cleanedFilterFunc,
  windowDuration, slideDuration, partitioner
)
{code}
The result of this test code is:
{quote}
= resultWindow3 is:
= resultWindow4 is:
(hello,0)
(world,0)
{quote}
*resultWindow3* is empty but *resultWindow4* has two elements whose keys were 
received before.


 Same transformation in DStream leads to different result
 

 Key: SPARK-6605
 URL: https://issues.apache.org/jira/browse/SPARK-6605
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: SaintBacchus
 Fix For: 1.4.0


 The transformation *reduceByKeyAndWindow* has two implementations: one use 
 the *WindowDstream* and the other use *ReducedWindowedDStream*.
 But the result always is the same, except when an empty windows occurs.
 As a wordcount example, if a period of time (larger than window time) has no 
 data coming, the first *reduceByKeyAndWindow*  has no elem inside but the 
 second has many elem with the zero value inside.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle

2015-03-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386541#comment-14386541
 ] 

Sean Owen commented on SPARK-3441:
--

This is mentioned in the change for https://github.com/apache/spark/pull/5074 
but I think the work here is to explain more deeply the rationale and 
partitioner details in the scaladoc

 Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style 
 shuffle
 ---

 Key: SPARK-3441
 URL: https://issues.apache.org/jira/browse/SPARK-3441
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Patrick Wendell
Assignee: Sandy Ryza

 I think it would be good to say something like this in the doc for 
 repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy:
 {code}
 This can be used to enact a Hadoop Style shuffle along with a call to 
 mapPartitions, e.g.:
rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...)
 {code}
 It might also be nice to add a version that doesn't take a partitioner and/or 
 to mention this in the groupBy javadoc. I guess it depends a bit whether we 
 consider this to be an API we want people to use more widely or whether we 
 just consider it a narrow stable API mostly for Hive-on-Spark. If we want 
 people to consider this API when porting workloads from Hadoop, then it might 
 be worth documenting better.
 What do you think [~rxin] and [~matei]?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5836) Highlight in Spark documentation that by default Spark does not delete its temporary files

2015-03-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5836:
-
Priority: Minor  (was: Major)

 Highlight in Spark documentation that by default Spark does not delete its 
 temporary files
 --

 Key: SPARK-5836
 URL: https://issues.apache.org/jira/browse/SPARK-5836
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Tomasz Dudziak
Assignee: Ilya Ganelin
Priority: Minor
 Fix For: 1.3.1, 1.4.0


 We recently learnt the hard way (in a prod system) that Spark by default does 
 not delete its temporary files until it is stopped. WIthin a relatively short 
 time span of heavy Spark use the disk of our prod machine filled up 
 completely because of multiple shuffle files written to it. We think there 
 should be better documentation around the fact that after a job is finished 
 it leaves a lot of rubbish behind so that this does not come as a surprise.
 Probably a good place to highlight that fact would be the documentation of 
 {{spark.local.dir}} property, which controls where Spark temporary files are 
 written. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6282) Strange Python import error when using random() in a lambda function

2015-03-30 Thread Dmytro Bielievtsov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386564#comment-14386564
 ] 

Dmytro Bielievtsov edited comment on SPARK-6282 at 3/30/15 11:18 AM:
-

I had the same strange issue when computing rdd's via blaze interface 
(ContinuumIO). As strangely as it appeared, It disappeared after I removed 
'six*' files from '/usr/lib/python2.7/dist-packages' and then reinstalled 'six' 
with easy_install-2.7


was (Author: dmytro):
I had the same strange issue when computing rdd's via blaze interface 
(ContinuumIO). As strangely as it appeared, It disappeared after I removed 
'six*' files from '/usr/lib/python-2.7/dist-packages' and then reinstalled 
'six' with easy_install-2.7

 Strange Python import error when using random() in a lambda function
 

 Key: SPARK-6282
 URL: https://issues.apache.org/jira/browse/SPARK-6282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Kubuntu 14.04, Python 2.7.6
Reporter: Pavel Laskov
Priority: Minor

 Consider the exemplary Python code below:
from random import random
from pyspark.context import SparkContext
from xval_mllib import read_csv_file_as_list
 if __name__ == __main__: 
 sc = SparkContext(appName=Random() bug test)
 data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv'))
 #data = sc.parallelize([1, 2, 3, 4, 5], 2)
 d = data.map(lambda x: (random(), x))
 print d.first()
 Data is read from a large CSV file. Running this code results in a Python 
 import error:
 ImportError: No module named _winreg
 If I use 'import random' and 'random.random()' in the lambda function no 
 error occurs. Also no error occurs, for both kinds of import statements, for 
 a small artificial data set like the one shown in a commented line.  
 The full error trace, the source code of csv reading code (function 
 'read_csv_file_as_list' is my own) as well as a sample dataset (the original 
 dataset is about 8M large) can be provided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4226) SparkSQL - Add support for subqueries in predicates

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4226:
---

Assignee: (was: Apache Spark)

 SparkSQL - Add support for subqueries in predicates
 ---

 Key: SPARK-4226
 URL: https://issues.apache.org/jira/browse/SPARK-4226
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
 Environment: Spark 1.2 snapshot
Reporter: Terry Siu

 I have a test table defined in Hive as follows:
 CREATE TABLE sparkbug (
   id INT,
   event STRING
 ) STORED AS PARQUET;
 and insert some sample data with ids 1, 2, 3.
 In a Spark shell, I then create a HiveContext and then execute the following 
 HQL to test out subquery predicates:
 val hc = HiveContext(hc)
 hc.hql(select customerid from sparkbug where customerid in (select 
 customerid from sparkbug where customerid in (2,3)))
 I get the following error:
 java.lang.RuntimeException: Unsupported language features in query: select 
 customerid from sparkbug where customerid in (select customerid from sparkbug 
 where customerid in (2,3))
 TOK_QUERY
   TOK_FROM
 TOK_TABREF
   TOK_TABNAME
 sparkbug
   TOK_INSERT
 TOK_DESTINATION
   TOK_DIR
 TOK_TMP_FILE
 TOK_SELECT
   TOK_SELEXPR
 TOK_TABLE_OR_COL
   customerid
 TOK_WHERE
   TOK_SUBQUERY_EXPR
 TOK_SUBQUERY_OP
   in
 TOK_QUERY
   TOK_FROM
 TOK_TABREF
   TOK_TABNAME
 sparkbug
   TOK_INSERT
 TOK_DESTINATION
   TOK_DIR
 TOK_TMP_FILE
 TOK_SELECT
   TOK_SELEXPR
 TOK_TABLE_OR_COL
   customerid
 TOK_WHERE
   TOK_FUNCTION
 in
 TOK_TABLE_OR_COL
   customerid
 2
 3
 TOK_TABLE_OR_COL
   customerid
 scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
 TOK_SUBQUERY_EXPR :
 TOK_SUBQUERY_EXPR
   TOK_SUBQUERY_OP
 in
   TOK_QUERY
 TOK_FROM
   TOK_TABREF
 TOK_TABNAME
   sparkbug
 TOK_INSERT
   TOK_DESTINATION
 TOK_DIR
   TOK_TMP_FILE
   TOK_SELECT
 TOK_SELEXPR
   TOK_TABLE_OR_COL
 customerid
   TOK_WHERE
 TOK_FUNCTION
   in
   TOK_TABLE_OR_COL
 customerid
   2
   3
   TOK_TABLE_OR_COL
 customerid
  +
  
 org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)
 
 at scala.sys.package$.error(package.scala:27)
 at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
 at 
 org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
 at 
 org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
 at 
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
 This thread
 http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html
 also brings up lack of subquery support in SparkSQL. It would be nice to have 
 subquery predicate support in a near, future release (1.3, maybe?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6608) Make DataFrame.rdd a lazy val

2015-03-30 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-6608:
-

 Summary: Make DataFrame.rdd a lazy val
 Key: SPARK-6608
 URL: https://issues.apache.org/jira/browse/SPARK-6608
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Priority: Minor


Before 1.3.0, {{SchemaRDD.id}} works as a unique identifier of each 
{{SchemaRDD}}. In 1.3.0, unlike {{SchemaRDD}}, {{DataFrame}} is no longer an 
RDD, and {{DataFrame.rdd}} is actually a function which always return a new RDD 
instance. Making {{DataFrame.rdd}} a {{lazy val}} should bring the unique 
identifier back.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5456) Decimal Type comparison issue

2015-03-30 Thread Karthik Gorthi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386628#comment-14386628
 ] 

Karthik Gorthi commented on SPARK-5456:
---

One workaround we followed is to convert all Decimal Datatype columns to 
integer (truncated).
But, this is not possible when the application needs to connect to third-party 
databases were the datatype is obviously not under our control. 

So, this is a serious bug, IMHO, which prohibits using Spark in cases, unless 
there is a workaround?

 Decimal Type comparison issue
 -

 Key: SPARK-5456
 URL: https://issues.apache.org/jira/browse/SPARK-5456
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Kuldeep

 Not quite able to figure this out but here is a junit test to reproduce this, 
 in JavaAPISuite.java
 {code:title=DecimalBug.java}
   @Test
   public void decimalQueryTest() {
 ListRow decimalTable = new ArrayListRow();
 decimalTable.add(RowFactory.create(new BigDecimal(1), new 
 BigDecimal(2)));
 decimalTable.add(RowFactory.create(new BigDecimal(3), new 
 BigDecimal(4)));
 JavaRDDRow rows = sc.parallelize(decimalTable);
 ListStructField fields = new ArrayListStructField(7);
 fields.add(DataTypes.createStructField(a, 
 DataTypes.createDecimalType(), true));
 fields.add(DataTypes.createStructField(b, 
 DataTypes.createDecimalType(), true));
 sqlContext.applySchema(rows.rdd(), 
 DataTypes.createStructType(fields)).registerTempTable(foo);
 Assert.assertEquals(sqlContext.sql(select * from foo where a  
 0).collectAsList(), decimalTable);
   }
 {code}
 Fails with
 java.lang.ClassCastException: java.math.BigDecimal cannot be cast to 
 org.apache.spark.sql.types.Decimal



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6596) fix the instruction on building scaladoc

2015-03-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6596.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5253
[https://github.com/apache/spark/pull/5253]

 fix the instruction on building scaladoc 
 -

 Key: SPARK-6596
 URL: https://issues.apache.org/jira/browse/SPARK-6596
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Nan Zhu
 Fix For: 1.4.0


 In README.md under docs/ directory, it says that 
 You can build just the Spark scaladoc by running build/sbt doc from the 
 SPARK_PROJECT_ROOT directory.
 I guess the right approach is build/sbt unidoc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6596) fix the instruction on building scaladoc

2015-03-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6596:
-
Priority: Trivial  (was: Major)
Assignee: Nan Zhu

 fix the instruction on building scaladoc 
 -

 Key: SPARK-6596
 URL: https://issues.apache.org/jira/browse/SPARK-6596
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Nan Zhu
Assignee: Nan Zhu
Priority: Trivial
 Fix For: 1.4.0


 In README.md under docs/ directory, it says that 
 You can build just the Spark scaladoc by running build/sbt doc from the 
 SPARK_PROJECT_ROOT directory.
 I guess the right approach is build/sbt unidoc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6595) DataFrame self joins with MetastoreRelations fail

2015-03-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6595.
---
Resolution: Fixed

Fixed by https://github.com/apache/spark/pull/5251

 DataFrame self joins with MetastoreRelations fail
 -

 Key: SPARK-6595
 URL: https://issues.apache.org/jira/browse/SPARK-6595
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6609) explicit checkpoint does not work

2015-03-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6609.
--
Resolution: Invalid

[~lisendong] Actually you were not even pointing at the 1.3 branch. SPARK-5955 
did indeed already resolve this. You probably want to look at master or at 
least the branch you are saying is affected in the future.

 explicit checkpoint does not work
 -

 Key: SPARK-6609
 URL: https://issues.apache.org/jira/browse/SPARK-6609
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.3.0
Reporter: lisendong
Priority: Minor
   Original Estimate: 24h
  Remaining Estimate: 24h

 https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
 from the code, I found the explicit feedback ALS does not do checkpoint()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6602) Replace direct use of Akka with Spark RPC interface

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6602:
---

Assignee: Shixiong Zhu  (was: Apache Spark)

 Replace direct use of Akka with Spark RPC interface
 ---

 Key: SPARK-6602
 URL: https://issues.apache.org/jira/browse/SPARK-6602
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6602) Replace direct use of Akka with Spark RPC interface

2015-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386875#comment-14386875
 ] 

Apache Spark commented on SPARK-6602:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/5268

 Replace direct use of Akka with Spark RPC interface
 ---

 Key: SPARK-6602
 URL: https://issues.apache.org/jira/browse/SPARK-6602
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6602) Replace direct use of Akka with Spark RPC interface

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6602:
---

Assignee: Apache Spark  (was: Shixiong Zhu)

 Replace direct use of Akka with Spark RPC interface
 ---

 Key: SPARK-6602
 URL: https://issues.apache.org/jira/browse/SPARK-6602
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4123) Show dependency changes in pull requests

2015-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386888#comment-14386888
 ] 

Apache Spark commented on SPARK-4123:
-

User 'brennonyork' has created a pull request for this issue:
https://github.com/apache/spark/pull/5269

 Show dependency changes in pull requests
 

 Key: SPARK-4123
 URL: https://issues.apache.org/jira/browse/SPARK-4123
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Brennon York
Priority: Critical
 Fix For: 1.4.0


 We should inspect the classpath of Spark's assembly jar for every pull 
 request. This only takes a few seconds in Maven and it will help weed out 
 dependency changes from the master branch. Ideally we'd post any dependency 
 changes in the pull request message.
 {code}
 $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
 INFO | tr : \n | awk -F/ '{print $NF}' | sort  my-classpath
 $ git checkout apache/master
 $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
 INFO | tr : \n | awk -F/ '{print $NF}' | sort  master-classpath
 $ diff my-classpath master-classpath
  chill-java-0.3.6.jar
  chill_2.10-0.3.6.jar
 ---
  chill-java-0.5.0.jar
  chill_2.10-0.5.0.jar
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5990) Model import/export for IsotonicRegression

2015-03-30 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386936#comment-14386936
 ] 

Yanbo Liang commented on SPARK-5990:


[~josephkb] Could you assign this to me?

 Model import/export for IsotonicRegression
 --

 Key: SPARK-5990
 URL: https://issues.apache.org/jira/browse/SPARK-5990
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Add save/load for IsotonicRegressionModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4226) SparkSQL - Add support for subqueries in predicates

2015-03-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-4226:
--
Description: 
I have a test table defined in Hive as follows:
{code:sql}
CREATE TABLE sparkbug (
  id INT,
  event STRING
) STORED AS PARQUET;
{code}
and insert some sample data with ids 1, 2, 3.

In a Spark shell, I then create a HiveContext and then execute the following 
HQL to test out subquery predicates:
{code}
val hc = HiveContext(hc)
hc.hql(select customerid from sparkbug where customerid in (select customerid 
from sparkbug where customerid in (2,3)))
{code}
I get the following error:
{noformat}
java.lang.RuntimeException: Unsupported language features in query: select 
customerid from sparkbug where customerid in (select customerid from sparkbug 
where customerid in (2,3))
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
sparkbug
  TOK_INSERT
TOK_DESTINATION
  TOK_DIR
TOK_TMP_FILE
TOK_SELECT
  TOK_SELEXPR
TOK_TABLE_OR_COL
  customerid
TOK_WHERE
  TOK_SUBQUERY_EXPR
TOK_SUBQUERY_OP
  in
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
sparkbug
  TOK_INSERT
TOK_DESTINATION
  TOK_DIR
TOK_TMP_FILE
TOK_SELECT
  TOK_SELEXPR
TOK_TABLE_OR_COL
  customerid
TOK_WHERE
  TOK_FUNCTION
in
TOK_TABLE_OR_COL
  customerid
2
3
TOK_TABLE_OR_COL
  customerid

scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
TOK_SUBQUERY_EXPR :
TOK_SUBQUERY_EXPR
  TOK_SUBQUERY_OP
in
  TOK_QUERY
TOK_FROM
  TOK_TABREF
TOK_TABNAME
  sparkbug
TOK_INSERT
  TOK_DESTINATION
TOK_DIR
  TOK_TMP_FILE
  TOK_SELECT
TOK_SELEXPR
  TOK_TABLE_OR_COL
customerid
  TOK_WHERE
TOK_FUNCTION
  in
  TOK_TABLE_OR_COL
customerid
  2
  3
  TOK_TABLE_OR_COL
customerid
 +
 
org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)

at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
{noformat}
[This 
thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html]
 also brings up lack of subquery support in SparkSQL. It would be nice to have 
subquery predicate support in a near, future release (1.3, maybe?).

  was:
I have a test table defined in Hive as follows:

CREATE TABLE sparkbug (
  id INT,
  event STRING
) STORED AS PARQUET;

and insert some sample data with ids 1, 2, 3.

In a Spark shell, I then create a HiveContext and then execute the following 
HQL to test out subquery predicates:

val hc = HiveContext(hc)
hc.hql(select customerid from sparkbug where customerid in (select customerid 
from sparkbug where customerid in (2,3)))

I get the following error:

java.lang.RuntimeException: Unsupported language features in query: select 
customerid from sparkbug where customerid in (select customerid from sparkbug 
where customerid in (2,3))
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
sparkbug
  TOK_INSERT
TOK_DESTINATION
  TOK_DIR
TOK_TMP_FILE
TOK_SELECT
  TOK_SELEXPR
TOK_TABLE_OR_COL
  customerid
TOK_WHERE
  TOK_SUBQUERY_EXPR
TOK_SUBQUERY_OP
  in
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
sparkbug
  TOK_INSERT
TOK_DESTINATION
  TOK_DIR
TOK_TMP_FILE
TOK_SELECT
  TOK_SELEXPR
TOK_TABLE_OR_COL
  customerid
TOK_WHERE
  TOK_FUNCTION
in
TOK_TABLE_OR_COL
  customerid
2
3
TOK_TABLE_OR_COL
  customerid

scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
TOK_SUBQUERY_EXPR :
TOK_SUBQUERY_EXPR
  TOK_SUBQUERY_OP
in
  TOK_QUERY
TOK_FROM
  TOK_TABREF
TOK_TABNAME
  sparkbug
TOK_INSERT
  TOK_DESTINATION
TOK_DIR
  TOK_TMP_FILE
  TOK_SELECT
TOK_SELEXPR
  TOK_TABLE_OR_COL
customerid
  TOK_WHERE
TOK_FUNCTION
  in
  TOK_TABLE_OR_COL

[jira] [Assigned] (SPARK-2883) Spark Support for ORCFile format

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2883:
---

Assignee: (was: Apache Spark)

 Spark Support for ORCFile format
 

 Key: SPARK-2883
 URL: https://issues.apache.org/jira/browse/SPARK-2883
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, SQL
Reporter: Zhan Zhang
Priority: Blocker
 Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
 pm jobtracker.png, orc.diff


 Verify the support of OrcInputFormat in spark, fix issues if exists and add 
 documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2883) Spark Support for ORCFile format

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2883:
---

Assignee: Apache Spark

 Spark Support for ORCFile format
 

 Key: SPARK-2883
 URL: https://issues.apache.org/jira/browse/SPARK-2883
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, SQL
Reporter: Zhan Zhang
Assignee: Apache Spark
Priority: Blocker
 Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
 pm jobtracker.png, orc.diff


 Verify the support of OrcInputFormat in spark, fix issues if exists and add 
 documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6258) Python MLlib API missing items: Clustering

2015-03-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387115#comment-14387115
 ] 

Joseph K. Bradley commented on SPARK-6258:
--

[~hrishikesh], glad to hear you're interested!  I'd recommend picking off one 
of these tasks.  I just created another JIRA for part of this task which should 
be a good one to start with: [SPARK-6612]  Does that sound good?  Also, please 
check out this guide; we try to follow these guidelines closely: 
[https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark]

If you have implementation questions, we can discuss them on github after you 
send a PR.

Thanks!

 Python MLlib API missing items: Clustering
 --

 Key: SPARK-6258
 URL: https://issues.apache.org/jira/browse/SPARK-6258
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 KMeans
 * setEpsilon
 * setInitializationSteps
 KMeansModel
 * computeCost
 * k
 GaussianMixture
 * setInitialModel
 GaussianMixtureModel
 * k
 Completely missing items which should be fixed in separate JIRAs (which have 
 been created and linked to the umbrella JIRA)
 * LDA
 * PowerIterationClustering
 * StreamingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6403) Launch master as spot instance on EC2

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6403:
---

Assignee: (was: Apache Spark)

 Launch master as spot instance on EC2
 -

 Key: SPARK-6403
 URL: https://issues.apache.org/jira/browse/SPARK-6403
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Affects Versions: 1.2.1
Reporter: Adam Vogel
Priority: Minor

 Currently the spark_ec2.py script only supports requesting slaves as spot 
 instances. Launching the master as a spot instance has potential cost 
 savings, at the risk of losing the Spark cluster without warning. Unless 
 users include logic for relaunching slaves when lost, it is usually the case 
 that all slaves are lost simultaneously. Thus, for jobs which do not require 
 resilience to losing spot instances, being able to launch the master as a 
 spot instance saves money.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6611) Add support for INTEGER as synonym of INT to DDLParser

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6611:
---

Assignee: Apache Spark

 Add support for INTEGER as synonym of INT to DDLParser
 --

 Key: SPARK-6611
 URL: https://issues.apache.org/jira/browse/SPARK-6611
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Santiago M. Mola
Assignee: Apache Spark
Priority: Minor

 Add support for INTEGER as synonym of INT to DDLParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6611) Add support for INTEGER as synonym of INT to DDLParser

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6611:
---

Assignee: (was: Apache Spark)

 Add support for INTEGER as synonym of INT to DDLParser
 --

 Key: SPARK-6611
 URL: https://issues.apache.org/jira/browse/SPARK-6611
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Santiago M. Mola
Priority: Minor

 Add support for INTEGER as synonym of INT to DDLParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6611) Add support for INTEGER as synonym of INT to DDLParser

2015-03-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386945#comment-14386945
 ] 

Apache Spark commented on SPARK-6611:
-

User 'smola' has created a pull request for this issue:
https://github.com/apache/spark/pull/5271

 Add support for INTEGER as synonym of INT to DDLParser
 --

 Key: SPARK-6611
 URL: https://issues.apache.org/jira/browse/SPARK-6611
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Santiago M. Mola
Priority: Minor

 Add support for INTEGER as synonym of INT to DDLParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6403) Launch master as spot instance on EC2

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6403:
---

Assignee: Apache Spark

 Launch master as spot instance on EC2
 -

 Key: SPARK-6403
 URL: https://issues.apache.org/jira/browse/SPARK-6403
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Affects Versions: 1.2.1
Reporter: Adam Vogel
Assignee: Apache Spark
Priority: Minor

 Currently the spark_ec2.py script only supports requesting slaves as spot 
 instances. Launching the master as a spot instance has potential cost 
 savings, at the risk of losing the Spark cluster without warning. Unless 
 users include logic for relaunching slaves when lost, it is usually the case 
 that all slaves are lost simultaneously. Thus, for jobs which do not require 
 resilience to losing spot instances, being able to launch the master as a 
 spot instance saves money.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6239) Spark MLlib fpm#FPGrowth minSupport should use long instead

2015-03-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386933#comment-14386933
 ] 

Sean Owen commented on SPARK-6239:
--

Just the little API overhead for littler gain IMHO. I'm open to other committer 
opinions but just didn't seem worth changing. I would imagine a relative value 
is more usually useful. 

 Spark MLlib fpm#FPGrowth minSupport should use long instead
 ---

 Key: SPARK-6239
 URL: https://issues.apache.org/jira/browse/SPARK-6239
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Littlestar
Priority: Minor

 Spark MLlib fpm#FPGrowth minSupport should use long instead
 ==
 val minCount = math.ceil(minSupport * count).toLong
 because:
 1. [count]numbers of datasets is not kown before read.
 2. [minSupport ]double precision.
 from mahout#FPGrowthDriver.java
 addOption(minSupport, s, (Optional) The minimum number of times a 
 co-occurrence must be present.
   +  Default Value: 3, 3);
 I just want to set minCount=2 for test.
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387085#comment-14387085
 ] 

Joseph K. Bradley commented on SPARK-5564:
--

[~debasish83] I mainly used a Wikipedia dataset.  Here's an S3 bucket 
(requestor pays) which [~sparks] created: 
[s3://files.sparks.requester.pays/enwiki_category_text/] which holds a big 
Wikipedia dataset.  I'm not sure if it's the same one I used, but it should be 
similar qualitatively.  Mine had ~1.1 billion tokens, with about 1 million 
documents and 1 million terms (vocab size).

As far as scaling, the EM code scaled linearly with the number of topics K.  
Communication was the bottleneck for sizable datasets, and it scales linearly 
with K.  The largest K I've run with on that dataset was K=100; that was using 
16 r3.2xlarge workers.

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6611) Add support for INTEGER as synonym of INT to DDLParser

2015-03-30 Thread Santiago M. Mola (JIRA)
Santiago M. Mola created SPARK-6611:
---

 Summary: Add support for INTEGER as synonym of INT to DDLParser
 Key: SPARK-6611
 URL: https://issues.apache.org/jira/browse/SPARK-6611
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Santiago M. Mola
Priority: Minor


Add support for INTEGER as synonym of INT to DDLParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6603) SQLContext.registerFunction - SQLContext.udf.register

2015-03-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387105#comment-14387105
 ] 

Reynold Xin commented on SPARK-6603:


How about not deprecating registerFunction, and have both? So Scala users when 
migrating to Python don't get confused as much.


 SQLContext.registerFunction - SQLContext.udf.register
 --

 Key: SPARK-6603
 URL: https://issues.apache.org/jira/browse/SPARK-6603
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu

 We didn't change the Python implementation to use that. Maybe the best 
 strategy is to deprecate SQLContext.registerFunction, and just add 
 SQLContext.udf.register.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6612) Python KMeans parity

2015-03-30 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-6612:


 Summary: Python KMeans parity
 Key: SPARK-6612
 URL: https://issues.apache.org/jira/browse/SPARK-6612
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor


This is a subtask of [SPARK-6258] for the Python API of KMeans.  These items 
are missing:

KMeans
* setEpsilon
* setInitializationSteps

KMeansModel
* computeCost
* k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5990) Model import/export for IsotonicRegression

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5990:
---

Assignee: Apache Spark

 Model import/export for IsotonicRegression
 --

 Key: SPARK-5990
 URL: https://issues.apache.org/jira/browse/SPARK-5990
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Apache Spark

 Add save/load for IsotonicRegressionModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5990) Model import/export for IsotonicRegression

2015-03-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5990:
---

Assignee: (was: Apache Spark)

 Model import/export for IsotonicRegression
 --

 Key: SPARK-5990
 URL: https://issues.apache.org/jira/browse/SPARK-5990
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Add save/load for IsotonicRegressionModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >