[jira] [Created] (SPARK-6603) SQLContext.registerFunction - SQLContext.udf.register
Reynold Xin created SPARK-6603: -- Summary: SQLContext.registerFunction - SQLContext.udf.register Key: SPARK-6603 URL: https://issues.apache.org/jira/browse/SPARK-6603 Project: Spark Issue Type: Improvement Reporter: Reynold Xin Assignee: Davies Liu We didn't change the Python implementation to use that. Maybe the best strategy is to deprecate SQLContext.registerFunction, and just add SQLContext.udf.register. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6604) Specify ip of python server scoket
Weizhong created SPARK-6604: --- Summary: Specify ip of python server scoket Key: SPARK-6604 URL: https://issues.apache.org/jira/browse/SPARK-6604 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Weizhong Priority: Minor In driver now will start a server socket and use a wildcard ip, use 127.0.0.0 is more reasonable, as we only use it by local Python process -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6605) Same transformation in DStream leads to different result
[ https://issues.apache.org/jira/browse/SPARK-6605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386324#comment-14386324 ] SaintBacchus commented on SPARK-6605: - Hi, [~tdas] can you have a look at this problem? Is this acceptable for streaming users? Same transformation in DStream leads to different result Key: SPARK-6605 URL: https://issues.apache.org/jira/browse/SPARK-6605 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: SaintBacchus Fix For: 1.4.0 The transformation *reduceByKeyAndWindow* has two implementations: one use the *WindowDstream* and the other use *ReducedWindowedDStream*. But the result always is the same, except when an empty windows occurs. As a wordcount example, if a period of time (larger than window time) has no data coming, the first *reduceByKeyAndWindow* has no elem inside but the second has many elem with the zero value inside. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6604) Specify ip of python server scoket
[ https://issues.apache.org/jira/browse/SPARK-6604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6604: --- Assignee: (was: Apache Spark) Specify ip of python server scoket -- Key: SPARK-6604 URL: https://issues.apache.org/jira/browse/SPARK-6604 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Weizhong Priority: Minor In driver now will start a server socket and use a wildcard ip, use 127.0.0.0 is more reasonable, as we only use it by local Python process -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5895) Add VectorSlicer
[ https://issues.apache.org/jira/browse/SPARK-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386296#comment-14386296 ] Xusen Yin commented on SPARK-5895: -- Is it possible to select by type or value? Like in Pandas: data = raw_data.select_dtypes(include=[np.float64, np.int64]) Say, I want to select all computable types such as int, double, float, or I want to select all columns that include Nan. Add VectorSlicer Key: SPARK-5895 URL: https://issues.apache.org/jira/browse/SPARK-5895 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng `VectorSlicer` takes a vector column and output a vector column with a subset of features. {code} val vs = new VectorSlicer() .setInputCol(user) .setSelectedFeatures(age, salary) .setOutputCol(usefulUserFeatures) {code} We should allow specifying selected features by indices and by names. It should preserve the output names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6594) Spark Streaming can't receive data from kafka
[ https://issues.apache.org/jira/browse/SPARK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386313#comment-14386313 ] Saisai Shao commented on SPARK-6594: Hi [~q79969786], would you please paste more warning logs. From your code it is hard to find the problem why Spark Streaming cannot receive data. Also would you please let us know what's version of Spark you're using? I guess mostly this issue is probably your misuse of Spark Streaming and Kafka, since lots of guys use Kafka + Spark Streaming, if this is a bug, it will be reported out soon. Spark Streaming can't receive data from kafka - Key: SPARK-6594 URL: https://issues.apache.org/jira/browse/SPARK-6594 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1 Reporter: q79969786 I use KafkaUtils to receive data from Kafka In my Spark streaming application as follows: MapString, Integer topicorder = new HashMapString, Integer(); topicorder.put(order, Integer.valueOf(readThread)); JavaPairReceiverInputDStreamString, String jPRIDSOrder = KafkaUtils.createStream(jssc, zkQuorum, group, topicorder); It worked well at fist, but after I submit this application several times, Spark streaming can‘t receive data anymore(Kafka works well). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway
[ https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6600: --- Assignee: (was: Apache Spark) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway -- Key: SPARK-6600 URL: https://issues.apache.org/jira/browse/SPARK-6600 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Use case: User has set up the hadoop hdfs nfs gateway service on their spark_ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. See linked issue. Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway
[ https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6600: --- Assignee: Apache Spark Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway -- Key: SPARK-6600 URL: https://issues.apache.org/jira/browse/SPARK-6600 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Assignee: Apache Spark Use case: User has set up the hadoop hdfs nfs gateway service on their spark_ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. See linked issue. Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway
[ https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386267#comment-14386267 ] Apache Spark commented on SPARK-6600: - User 'florianverhein' has created a pull request for this issue: https://github.com/apache/spark/pull/5257 Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway -- Key: SPARK-6600 URL: https://issues.apache.org/jira/browse/SPARK-6600 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Use case: User has set up the hadoop hdfs nfs gateway service on their spark_ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. See linked issue. Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6604) Specify ip of python server scoket
[ https://issues.apache.org/jira/browse/SPARK-6604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386268#comment-14386268 ] Apache Spark commented on SPARK-6604: - User 'Sephiroth-Lin' has created a pull request for this issue: https://github.com/apache/spark/pull/5256 Specify ip of python server scoket -- Key: SPARK-6604 URL: https://issues.apache.org/jira/browse/SPARK-6604 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Weizhong Priority: Minor In driver now will start a server socket and use a wildcard ip, use 127.0.0.0 is more reasonable, as we only use it by local Python process -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6604) Specify ip of python server scoket
[ https://issues.apache.org/jira/browse/SPARK-6604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6604: --- Assignee: Apache Spark Specify ip of python server scoket -- Key: SPARK-6604 URL: https://issues.apache.org/jira/browse/SPARK-6604 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Weizhong Assignee: Apache Spark Priority: Minor In driver now will start a server socket and use a wildcard ip, use 127.0.0.0 is more reasonable, as we only use it by local Python process -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6605) Same transformation in DStream leads to different result
SaintBacchus created SPARK-6605: --- Summary: Same transformation in DStream leads to different result Key: SPARK-6605 URL: https://issues.apache.org/jira/browse/SPARK-6605 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: SaintBacchus Fix For: 1.4.0 The transformation *reduceByKeyAndWindow* has two implementations: one use the *WindowDstream* and the other use *ReducedWindowedDStream*. But the result always is the same, except when an empty windows occurs. As a wordcount example, if a period of time (larger than window time) has no data coming, the first *reduceByKeyAndWindow* has no elem inside but the second has many elem with the zero value inside. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5895) Add VectorSlicer
[ https://issues.apache.org/jira/browse/SPARK-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386365#comment-14386365 ] Xusen Yin edited comment on SPARK-5895 at 3/30/15 8:14 AM: --- I have another concern here. We can not reveal each column name of a `Vector`. Given the selected features age and salary, how to select these two columns from a vector? One solution is giving it a list of column names, say, `setColumnNames(List[String])`. But a more natural way to solve it is adding the list of column names in the `VectorAssembler`. was (Author: yinxusen): I have another concern here. We can not reveal each column name of a `Vector`. Given the selected features age and salary, how to select these two columns from a vector? One solution is giving it a list of column names, say, `setColumnNames(List[String])`. But a more natural way to solve it is adding the list of column names in `VectorAssembler`. Add VectorSlicer Key: SPARK-5895 URL: https://issues.apache.org/jira/browse/SPARK-5895 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng `VectorSlicer` takes a vector column and output a vector column with a subset of features. {code} val vs = new VectorSlicer() .setInputCol(user) .setSelectedFeatures(age, salary) .setOutputCol(usefulUserFeatures) {code} We should allow specifying selected features by indices and by names. It should preserve the output names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5894) Add PolynomialMapper
[ https://issues.apache.org/jira/browse/SPARK-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386346#comment-14386346 ] Xusen Yin commented on SPARK-5894: -- [~mengxr] [~josephkb] Do you have time to check it? Thanks! Add PolynomialMapper Key: SPARK-5894 URL: https://issues.apache.org/jira/browse/SPARK-5894 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng `PolynomialMapper` takes a vector column and outputs a vector column with polynomial feature mapping. {code} val poly = new PolynomialMapper() .setInputCol(features) .setDegree(2) .setOutputCols(polyFeatures) {code} It should handle the output feature names properly. Maybe we can make a better name for it instead of calling it `PolynomialMapper`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6606) Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd object.
[ https://issues.apache.org/jira/browse/SPARK-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386355#comment-14386355 ] Apache Spark commented on SPARK-6606: - User 'suyanNone' has created a pull request for this issue: https://github.com/apache/spark/pull/5259 Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd object. - Key: SPARK-6606 URL: https://issues.apache.org/jira/browse/SPARK-6606 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0, 1.3.0 Reporter: SuYan 1. Use code like belows, will found accumulator deserialized twice. first: {code} task = ser.deserialize[Task[Any]](taskBytes, Thread.currentThread.getContextClassLoader) {code} second: {code} val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])]( ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader) {code} which the first deserialized is not what expected. because ResultTask or ShuffleMapTask will have a partition object. in class {code} CoGroupedRDD[K](@transient var rdds: Seq[RDD[_ : Product2[K, _]]], part: Partitioner) {code}, the CogroupPartition may contains a CoGroupDep: {code} NarrowCoGroupSplitDep( rdd: RDD[_], splitIndex: Int, var split: Partition ) extends CoGroupSplitDep { {code} in that *NarrowCoGroupSplitDep*, it will bring into rdd object, which result into the first deserialized. example: {code} val acc1 = sc.accumulator(0, test1) val acc2 = sc.accumulator(0, test2) val rdd1 = sc.parallelize((1 to 10).toSeq, 3) val rdd2 = sc.parallelize((1 to 10).toSeq, 3) val combine1 = rdd1.map { case a = (a, 1)}.combineByKey(a = { acc1 += 1 a }, (a: Int, b: Int) = { a + b }, (a: Int, b: Int) = { a + b }, new HashPartitioner(3), mapSideCombine = false) val combine2 = rdd2.map { case a = (a, 1)}.combineByKey( a = { acc2 += 1 a }, (a: Int, b: Int) = { a + b }, (a: Int, b: Int) = { a + b }, new HashPartitioner(3), mapSideCombine = false) combine1.cogroup(combine2, new HashPartitioner(3)).count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6606) Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd object.
[ https://issues.apache.org/jira/browse/SPARK-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6606: --- Assignee: Apache Spark Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd object. - Key: SPARK-6606 URL: https://issues.apache.org/jira/browse/SPARK-6606 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0, 1.3.0 Reporter: SuYan Assignee: Apache Spark 1. Use code like belows, will found accumulator deserialized twice. first: {code} task = ser.deserialize[Task[Any]](taskBytes, Thread.currentThread.getContextClassLoader) {code} second: {code} val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])]( ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader) {code} which the first deserialized is not what expected. because ResultTask or ShuffleMapTask will have a partition object. in class {code} CoGroupedRDD[K](@transient var rdds: Seq[RDD[_ : Product2[K, _]]], part: Partitioner) {code}, the CogroupPartition may contains a CoGroupDep: {code} NarrowCoGroupSplitDep( rdd: RDD[_], splitIndex: Int, var split: Partition ) extends CoGroupSplitDep { {code} in that *NarrowCoGroupSplitDep*, it will bring into rdd object, which result into the first deserialized. example: {code} val acc1 = sc.accumulator(0, test1) val acc2 = sc.accumulator(0, test2) val rdd1 = sc.parallelize((1 to 10).toSeq, 3) val rdd2 = sc.parallelize((1 to 10).toSeq, 3) val combine1 = rdd1.map { case a = (a, 1)}.combineByKey(a = { acc1 += 1 a }, (a: Int, b: Int) = { a + b }, (a: Int, b: Int) = { a + b }, new HashPartitioner(3), mapSideCombine = false) val combine2 = rdd2.map { case a = (a, 1)}.combineByKey( a = { acc2 += 1 a }, (a: Int, b: Int) = { a + b }, (a: Int, b: Int) = { a + b }, new HashPartitioner(3), mapSideCombine = false) combine1.cogroup(combine2, new HashPartitioner(3)).count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6594) Spark Streaming can't receive data from kafka
[ https://issues.apache.org/jira/browse/SPARK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-6594. Resolution: Not a Problem Spark Streaming can't receive data from kafka - Key: SPARK-6594 URL: https://issues.apache.org/jira/browse/SPARK-6594 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1 Reporter: q79969786 I use KafkaUtils to receive data from Kafka In my Spark streaming application as follows: MapString, Integer topicorder = new HashMapString, Integer(); topicorder.put(order, Integer.valueOf(readThread)); JavaPairReceiverInputDStreamString, String jPRIDSOrder = KafkaUtils.createStream(jssc, zkQuorum, group, topicorder); It worked well at fist, but after I submit this application several times, Spark streaming can‘t receive data anymore(Kafka works well). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6528) IDF transformer
[ https://issues.apache.org/jira/browse/SPARK-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386460#comment-14386460 ] Xusen Yin commented on SPARK-6528: -- [~josephkb] Pls assign it to me. IDF transformer --- Key: SPARK-6528 URL: https://issues.apache.org/jira/browse/SPARK-6528 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xusen Yin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6401) Unable to load a old API input format in Spark streaming
[ https://issues.apache.org/jira/browse/SPARK-6401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386491#comment-14386491 ] Thomas F. commented on SPARK-6401: -- How do we proceed for this issue ? As we already have in the DStream the saveAsHadoopFiles with historical OutputFormat and saveAsNewAPIHadoopFiles with NewOutputFormat, do we rename StreamingContext.fileStream() into StreamingContext.newAPIHadoopFileStream(with NewInputFormat) and then add hadoopFileStream(with InputFormat) to be completely aligned with Spark Core for hadoop input/output ? Brgds. Unable to load a old API input format in Spark streaming Key: SPARK-6401 URL: https://issues.apache.org/jira/browse/SPARK-6401 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Rémy DUBOIS Priority: Minor The fileStream method of the JavaStreamingContext class does not allow using a old API InputFormat. This feature exists in Spark batch but not in streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6606) Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd object.
SuYan created SPARK-6606: Summary: Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd object. Key: SPARK-6606 URL: https://issues.apache.org/jira/browse/SPARK-6606 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0, 1.2.0 Reporter: SuYan 1. Use code like belows, will found accumulator deserialized twice. first: {code} task = ser.deserialize[Task[Any]](taskBytes, Thread.currentThread.getContextClassLoader) {code} second: {code} val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])]( ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader) {code} which the first deserialized is not what expected. because ResultTask or ShuffleMapTask will have a partition object. in class {code} CoGroupedRDD[K](@transient var rdds: Seq[RDD[_ : Product2[K, _]]], part: Partitioner) {code}, the CogroupPartition may contains a CoGroupDep: {code} NarrowCoGroupSplitDep( rdd: RDD[_], splitIndex: Int, var split: Partition ) extends CoGroupSplitDep { {code} in that *NarrowCoGroupSplitDep*, it will bring into rdd object, which result into the first deserialized. example: {code} val acc1 = sc.accumulator(0, test1) val acc2 = sc.accumulator(0, test2) val rdd1 = sc.parallelize((1 to 10).toSeq, 3) val rdd2 = sc.parallelize((1 to 10).toSeq, 3) val combine1 = rdd1.map { case a = (a, 1)}.combineByKey(a = { acc1 += 1 a }, (a: Int, b: Int) = { a + b }, (a: Int, b: Int) = { a + b }, new HashPartitioner(3), mapSideCombine = false) val combine2 = rdd2.map { case a = (a, 1)}.combineByKey( a = { acc2 += 1 a }, (a: Int, b: Int) = { a + b }, (a: Int, b: Int) = { a + b }, new HashPartitioner(3), mapSideCombine = false) combine1.cogroup(combine2, new HashPartitioner(3)).count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6258) Python MLlib API missing items: Clustering
[ https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386394#comment-14386394 ] Hrishikesh commented on SPARK-6258: --- Hi [~josephkb] I am a newbie to spark and I would like to contribute. Could you assign this ticket to me? Python MLlib API missing items: Clustering -- Key: SPARK-6258 URL: https://issues.apache.org/jira/browse/SPARK-6258 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. KMeans * setEpsilon * setInitializationSteps KMeansModel * computeCost * k GaussianMixture * setInitialModel GaussianMixtureModel * k Completely missing items which should be fixed in separate JIRAs (which have been created and linked to the umbrella JIRA) * LDA * PowerIterationClustering * StreamingKMeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6401) Unable to load a old API input format in Spark streaming
[ https://issues.apache.org/jira/browse/SPARK-6401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386499#comment-14386499 ] Sean Owen commented on SPARK-6401: -- Since it's technically an API change to streaming I'd look for a nod from [~tdas] before proceeding. Unable to load a old API input format in Spark streaming Key: SPARK-6401 URL: https://issues.apache.org/jira/browse/SPARK-6401 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Rémy DUBOIS Priority: Minor The fileStream method of the JavaStreamingContext class does not allow using a old API InputFormat. This feature exists in Spark batch but not in streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5895) Add VectorSlicer
[ https://issues.apache.org/jira/browse/SPARK-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386365#comment-14386365 ] Xusen Yin commented on SPARK-5895: -- I have another concern here. We can not reveal each column name of a `Vector`. Given the selected features age and salary, how to select these two columns from a vector? One solution is giving it a list of column names, say, `setColumnNames(List[String])`. But a more natural way to solve it is adding the list of column names in `VectorAssembler`. Add VectorSlicer Key: SPARK-5895 URL: https://issues.apache.org/jira/browse/SPARK-5895 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng `VectorSlicer` takes a vector column and output a vector column with a subset of features. {code} val vs = new VectorSlicer() .setInputCol(user) .setSelectedFeatures(age, salary) .setOutputCol(usefulUserFeatures) {code} We should allow specifying selected features by indices and by names. It should preserve the output names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6594) Spark Streaming can't receive data from kafka
[ https://issues.apache.org/jira/browse/SPARK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386380#comment-14386380 ] q79969786 edited comment on SPARK-6594 at 3/30/15 8:25 AM: --- thanks, it worked after restart kafka was (Author: q79969786): thanks, it worked after reboot kafka Spark Streaming can't receive data from kafka - Key: SPARK-6594 URL: https://issues.apache.org/jira/browse/SPARK-6594 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1 Reporter: q79969786 I use KafkaUtils to receive data from Kafka In my Spark streaming application as follows: MapString, Integer topicorder = new HashMapString, Integer(); topicorder.put(order, Integer.valueOf(readThread)); JavaPairReceiverInputDStreamString, String jPRIDSOrder = KafkaUtils.createStream(jssc, zkQuorum, group, topicorder); It worked well at fist, but after I submit this application several times, Spark streaming can‘t receive data anymore(Kafka works well). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6594) Spark Streaming can't receive data from kafka
[ https://issues.apache.org/jira/browse/SPARK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386380#comment-14386380 ] q79969786 commented on SPARK-6594: -- thanks, it worked after reboot kafka Spark Streaming can't receive data from kafka - Key: SPARK-6594 URL: https://issues.apache.org/jira/browse/SPARK-6594 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1 Reporter: q79969786 I use KafkaUtils to receive data from Kafka In my Spark streaming application as follows: MapString, Integer topicorder = new HashMapString, Integer(); topicorder.put(order, Integer.valueOf(readThread)); JavaPairReceiverInputDStreamString, String jPRIDSOrder = KafkaUtils.createStream(jssc, zkQuorum, group, topicorder); It worked well at fist, but after I submit this application several times, Spark streaming can‘t receive data anymore(Kafka works well). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6605) Same transformation in DStream leads to different result
[ https://issues.apache.org/jira/browse/SPARK-6605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386510#comment-14386510 ] Sean Owen commented on SPARK-6605: -- What other implementation are you referring to? I can't see one that uses {{WindowedDStream}}. Are you talking about calling {{window}} and {{reduceByKey}} separately? What do you mean by many zeroes -- if there is no data, what could be the key? Same transformation in DStream leads to different result Key: SPARK-6605 URL: https://issues.apache.org/jira/browse/SPARK-6605 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: SaintBacchus Fix For: 1.4.0 The transformation *reduceByKeyAndWindow* has two implementations: one use the *WindowDstream* and the other use *ReducedWindowedDStream*. But the result always is the same, except when an empty windows occurs. As a wordcount example, if a period of time (larger than window time) has no data coming, the first *reduceByKeyAndWindow* has no elem inside but the second has many elem with the zero value inside. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4226) SparkSQL - Add support for subqueries in predicates
[ https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4226: --- Assignee: Apache Spark SparkSQL - Add support for subqueries in predicates --- Key: SPARK-4226 URL: https://issues.apache.org/jira/browse/SPARK-4226 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Environment: Spark 1.2 snapshot Reporter: Terry Siu Assignee: Apache Spark I have a test table defined in Hive as follows: CREATE TABLE sparkbug ( id INT, event STRING ) STORED AS PARQUET; and insert some sample data with ids 1, 2, 3. In a Spark shell, I then create a HiveContext and then execute the following HQL to test out subquery predicates: val hc = HiveContext(hc) hc.hql(select customerid from sparkbug where customerid in (select customerid from sparkbug where customerid in (2,3))) I get the following error: java.lang.RuntimeException: Unsupported language features in query: select customerid from sparkbug where customerid in (select customerid from sparkbug where customerid in (2,3)) TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_SUBQUERY_EXPR TOK_SUBQUERY_OP in TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_FUNCTION in TOK_TABLE_OR_COL customerid 2 3 TOK_TABLE_OR_COL customerid scala.NotImplementedError: No parse rules for ASTNode type: 817, text: TOK_SUBQUERY_EXPR : TOK_SUBQUERY_EXPR TOK_SUBQUERY_OP in TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_FUNCTION in TOK_TABLE_OR_COL customerid 2 3 TOK_TABLE_OR_COL customerid + org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098) at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) This thread http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html also brings up lack of subquery support in SparkSQL. It would be nice to have subquery predicate support in a near, future release (1.3, maybe?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6608) Make DataFrame.rdd a lazy val
[ https://issues.apache.org/jira/browse/SPARK-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6608: --- Assignee: (was: Apache Spark) Make DataFrame.rdd a lazy val - Key: SPARK-6608 URL: https://issues.apache.org/jira/browse/SPARK-6608 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Priority: Minor Before 1.3.0, {{SchemaRDD.id}} works as a unique identifier of each {{SchemaRDD}}. In 1.3.0, unlike {{SchemaRDD}}, {{DataFrame}} is no longer an RDD, and {{DataFrame.rdd}} is actually a function which always return a new RDD instance. Making {{DataFrame.rdd}} a {{lazy val}} should bring the unique identifier back. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6608) Make DataFrame.rdd a lazy val
[ https://issues.apache.org/jira/browse/SPARK-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386635#comment-14386635 ] Apache Spark commented on SPARK-6608: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/5265 Make DataFrame.rdd a lazy val - Key: SPARK-6608 URL: https://issues.apache.org/jira/browse/SPARK-6608 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Priority: Minor Before 1.3.0, {{SchemaRDD.id}} works as a unique identifier of each {{SchemaRDD}}. In 1.3.0, unlike {{SchemaRDD}}, {{DataFrame}} is no longer an RDD, and {{DataFrame.rdd}} is actually a function which always return a new RDD instance. Making {{DataFrame.rdd}} a {{lazy val}} should bring the unique identifier back. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6608) Make DataFrame.rdd a lazy val
[ https://issues.apache.org/jira/browse/SPARK-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reassigned SPARK-6608: - Assignee: Cheng Lian Make DataFrame.rdd a lazy val - Key: SPARK-6608 URL: https://issues.apache.org/jira/browse/SPARK-6608 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor Before 1.3.0, {{SchemaRDD.id}} works as a unique identifier of each {{SchemaRDD}}. In 1.3.0, unlike {{SchemaRDD}}, {{DataFrame}} is no longer an RDD, and {{DataFrame.rdd}} is actually a function which always return a new RDD instance. Making {{DataFrame.rdd}} a {{lazy val}} should bring the unique identifier back. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386636#comment-14386636 ] Kai Sasaki commented on SPARK-4036: --- [~mengxr] I write a design doc based on your advice. Thank you. Add Conditional Random Fields (CRF) algorithm to Spark MLlib Key: SPARK-4036 URL: https://issues.apache.org/jira/browse/SPARK-4036 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Kai Sasaki Conditional random fields (CRFs) are a class of statistical modelling method often applied in pattern recognition and machine learning, where they are used for structured prediction. The paper: http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6528) IDF transformer
[ https://issues.apache.org/jira/browse/SPARK-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6528: --- Assignee: Apache Spark IDF transformer --- Key: SPARK-6528 URL: https://issues.apache.org/jira/browse/SPARK-6528 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xusen Yin Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6528) IDF transformer
[ https://issues.apache.org/jira/browse/SPARK-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6528: --- Assignee: (was: Apache Spark) IDF transformer --- Key: SPARK-6528 URL: https://issues.apache.org/jira/browse/SPARK-6528 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xusen Yin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6528) IDF transformer
[ https://issues.apache.org/jira/browse/SPARK-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386664#comment-14386664 ] Apache Spark commented on SPARK-6528: - User 'yinxusen' has created a pull request for this issue: https://github.com/apache/spark/pull/5266 IDF transformer --- Key: SPARK-6528 URL: https://issues.apache.org/jira/browse/SPARK-6528 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xusen Yin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6607) Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema
[ https://issues.apache.org/jira/browse/SPARK-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6607: -- Assignee: Liang-Chi Hsieh Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema -- Key: SPARK-6607 URL: https://issues.apache.org/jira/browse/SPARK-6607 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh '(' and ')' are special characters used in Parquet schema for type annotation. When we run an aggregation query, we will obtain attribute name such as MAX(a). If we directly store the generated DataFrame as Parquet file, it causes failure when reading and parsing the stored schema string. Several methods can be adopted to solve this. This pr uses a simplest one to just replace attribute names before generating Parquet schema based on these attributes. Another possible method might be modifying all aggregation expression names from func(column) to func[column]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6609) explicit checkpoint does not work
lisendong created SPARK-6609: Summary: explicit checkpoint does not work Key: SPARK-6609 URL: https://issues.apache.org/jira/browse/SPARK-6609 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.3.0 Reporter: lisendong Priority: Critical Fix For: 1.3.0 https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala from the code, I found the explicit feedback ALS does not do checkpoint() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6610) explicit checkpoint does not work
lisendong created SPARK-6610: Summary: explicit checkpoint does not work Key: SPARK-6610 URL: https://issues.apache.org/jira/browse/SPARK-6610 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.3.0 Reporter: lisendong Priority: Critical Fix For: 1.3.0 https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala from the code, I found the explicit feedback ALS does not do checkpoint() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6610) mlllib explicit ALS checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-6610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-6610: --- Target Version/s: 1.3.1 (was: 1.3.0) Fix Version/s: (was: 1.3.0) mlllib explicit ALS checkpoint does not work Key: SPARK-6610 URL: https://issues.apache.org/jira/browse/SPARK-6610 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.3.0 Reporter: lisendong Priority: Critical Labels: spark Original Estimate: 24h Remaining Estimate: 24h https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala from the code, I found the explicit feedback ALS does not do checkpoint() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6607) Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema
[ https://issues.apache.org/jira/browse/SPARK-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6607: -- Target Version/s: 1.4.0 Affects Version/s: 1.1.1 1.2.1 1.3.0 Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema -- Key: SPARK-6607 URL: https://issues.apache.org/jira/browse/SPARK-6607 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.1, 1.2.1, 1.3.0 Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh '(' and ')' are special characters used in Parquet schema for type annotation. When we run an aggregation query, we will obtain attribute name such as MAX(a). If we directly store the generated DataFrame as Parquet file, it causes failure when reading and parsing the stored schema string. Several methods can be adopted to solve this. This pr uses a simplest one to just replace attribute names before generating Parquet schema based on these attributes. Another possible method might be modifying all aggregation expression names from func(column) to func[column]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3530) Pipeline and Parameters
[ https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386747#comment-14386747 ] Jao Rabary commented on SPARK-3530: --- Yes, the scenario is to instantiate a pre-trained caffe network. The problem with the broadcast is that I use a JNI binding of caffe and spark isn't able to serialize the object. Pipeline and Parameters --- Key: SPARK-3530 URL: https://issues.apache.org/jira/browse/SPARK-3530 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Fix For: 1.2.0 This part of the design doc is for pipelines and parameters. I put the design doc at https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing I will copy the proposed interfaces to this JIRA later. Some sample code can be viewed at: https://github.com/mengxr/spark-ml/ Please help review the design and post your comments here. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6517) Implement the Algorithm of Hierarchical Clustering
[ https://issues.apache.org/jira/browse/SPARK-6517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386707#comment-14386707 ] Apache Spark commented on SPARK-6517: - User 'yu-iskw' has created a pull request for this issue: https://github.com/apache/spark/pull/5267 Implement the Algorithm of Hierarchical Clustering -- Key: SPARK-6517 URL: https://issues.apache.org/jira/browse/SPARK-6517 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Yu Ishikawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6517) Implement the Algorithm of Hierarchical Clustering
[ https://issues.apache.org/jira/browse/SPARK-6517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6517: --- Assignee: Apache Spark Implement the Algorithm of Hierarchical Clustering -- Key: SPARK-6517 URL: https://issues.apache.org/jira/browse/SPARK-6517 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Yu Ishikawa Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6517) Implement the Algorithm of Hierarchical Clustering
[ https://issues.apache.org/jira/browse/SPARK-6517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6517: --- Assignee: (was: Apache Spark) Implement the Algorithm of Hierarchical Clustering -- Key: SPARK-6517 URL: https://issues.apache.org/jira/browse/SPARK-6517 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Yu Ishikawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6610) mlllib explicit ALS checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-6610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lisendong updated SPARK-6610: - Summary: mlllib explicit ALS checkpoint does not work (was: explicit checkpoint does not work) mlllib explicit ALS checkpoint does not work Key: SPARK-6610 URL: https://issues.apache.org/jira/browse/SPARK-6610 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.3.0 Reporter: lisendong Priority: Critical Labels: spark Fix For: 1.3.0 Original Estimate: 24h Remaining Estimate: 24h https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala from the code, I found the explicit feedback ALS does not do checkpoint() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6517) Implement the Algorithm of Hierarchical Clustering
[ https://issues.apache.org/jira/browse/SPARK-6517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386756#comment-14386756 ] Yu Ishikawa commented on SPARK-6517: Hi [~freeman-lab] and [~rnowling], Would you review the PR when you have time. The basic idea is not changed. The main difference between the new one and the old one is parallel processing of dividing clusters. The old version divides leaf nodes of a cluster tree separately. However, the new version divides leaf clusters simultaneously and then build a cluster tree.. As a result of the modification it is 1000x faster than the old one. Thanks Implement the Algorithm of Hierarchical Clustering -- Key: SPARK-6517 URL: https://issues.apache.org/jira/browse/SPARK-6517 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Yu Ishikawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6610) mlllib explicit ALS checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-6610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6610. -- Resolution: Duplicate Target Version/s: (was: 1.3.1) mlllib explicit ALS checkpoint does not work Key: SPARK-6610 URL: https://issues.apache.org/jira/browse/SPARK-6610 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.3.0 Reporter: lisendong Priority: Critical Labels: spark Original Estimate: 24h Remaining Estimate: 24h https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala from the code, I found the explicit feedback ALS does not do checkpoint() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3276) Provide a API to specify whether the old files need to be ignored in file input text DStream
[ https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386775#comment-14386775 ] Emre Sevinç commented on SPARK-3276: Any plans to make the private val {{FileInputDStream.MIN_REMEMBER_DURATION}} configurable via some API? It seems to be hard-coded as 1 minute in https://github.com/apache/spark/blob/branch-1.2/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L325, and this leads to files older than 1 minute not to be processed. Provide a API to specify whether the old files need to be ignored in file input text DStream Key: SPARK-3276 URL: https://issues.apache.org/jira/browse/SPARK-3276 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.2.0 Reporter: Jack Hu Priority: Minor Currently, only one API called textFileStream in StreamingContext to specify the text file dstream, which ignores the old files always. On some times, the old files is still useful. Need a API to let user choose whether the old files need to be ingored or not . The API currently in StreamingContext: def textFileStream(directory: String): DStream[String] = { fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time
[ https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386774#comment-14386774 ] Emre Sevinç commented on SPARK-6061: Any plans to make the private val {{FileInputDStream.MIN_REMEMBER_DURATION}} configurable via some API? It seems to be hard-coded as 1 minute in https://github.com/apache/spark/blob/branch-1.2/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L325, and this leads to files older than 1 minute not to be processed. File source dstream can not include the old file which timestamp is before the system time -- Key: SPARK-6061 URL: https://issues.apache.org/jira/browse/SPARK-6061 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Jack Hu Labels: FileSourceDStream, OlderFiles, Streaming Original Estimate: 1m Remaining Estimate: 1m The file source dstream (StreamContext.fileStream) has a properties named newFilesOnly to include the old files, it worked fine with 1.1.0, and broken at 1.2.1, the older files always be ignored no mattern what value is set. Here is the simple reproduce code: https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb The reason is that: the modTimeIgnoreThreshold in FileInputDStream::findNewFiles is set to a time closed to system time (Spark Streaming Clock time), so the files old than this time are ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6609) explicit checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-6609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6609: - Priority: Minor (was: Critical) Target Version/s: (was: 1.3.0) Fix Version/s: (was: 1.3.0) Issue Type: Improvement (was: Bug) Please don't set Fix Version as it's not resolved. This is a follow on to SPARK-5955. It is not a critical bug since it works fine for most use cases without this. explicit checkpoint does not work - Key: SPARK-6609 URL: https://issues.apache.org/jira/browse/SPARK-6609 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.3.0 Reporter: lisendong Priority: Minor Original Estimate: 24h Remaining Estimate: 24h https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala from the code, I found the explicit feedback ALS does not do checkpoint() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6609) explicit checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-6609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386781#comment-14386781 ] Guoqiang Li commented on SPARK-6609: We should merge [the PR 5076|https://github.com/apache/spark/pull/5076] to branch-1.3? explicit checkpoint does not work - Key: SPARK-6609 URL: https://issues.apache.org/jira/browse/SPARK-6609 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.3.0 Reporter: lisendong Priority: Minor Original Estimate: 24h Remaining Estimate: 24h https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala from the code, I found the explicit feedback ALS does not do checkpoint() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6568) spark-shell.cmd --jars option does not accept the jar that has space in its path
[ https://issues.apache.org/jira/browse/SPARK-6568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386552#comment-14386552 ] Steve Loughran commented on SPARK-6568: --- Can you show the full stack trace? spark-shell.cmd --jars option does not accept the jar that has space in its path Key: SPARK-6568 URL: https://issues.apache.org/jira/browse/SPARK-6568 Project: Spark Issue Type: Bug Components: Spark Core, Windows Affects Versions: 1.3.0 Environment: Windows 8.1 Reporter: Masayoshi TSUZUKI spark-shell.cmd --jars option does not accept the jar that has space in its path. The path of jar sometimes containes space in Windows. {code} bin\spark-shell.cmd --jars C:\Program Files\some\jar1.jar {code} this gets {code} Exception in thread main java.net.URISyntaxException: Illegal character in path at index 10: C:/Program Files/some/jar1.jar {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6282) Strange Python import error when using random() in a lambda function
[ https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386564#comment-14386564 ] Dmytro Bielievtsov commented on SPARK-6282: --- I had the same strange issue when computing rdd's via blaze interface (ContinuumIO). As strangely as it appeared, It strangely disappeared after I removed 'six*' files from '/usr/lib/python-2.7/dist-packages' and then reinstalled 'six' with easy_install-2.7 Strange Python import error when using random() in a lambda function Key: SPARK-6282 URL: https://issues.apache.org/jira/browse/SPARK-6282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Kubuntu 14.04, Python 2.7.6 Reporter: Pavel Laskov Priority: Minor Consider the exemplary Python code below: from random import random from pyspark.context import SparkContext from xval_mllib import read_csv_file_as_list if __name__ == __main__: sc = SparkContext(appName=Random() bug test) data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv')) #data = sc.parallelize([1, 2, 3, 4, 5], 2) d = data.map(lambda x: (random(), x)) print d.first() Data is read from a large CSV file. Running this code results in a Python import error: ImportError: No module named _winreg If I use 'import random' and 'random.random()' in the lambda function no error occurs. Also no error occurs, for both kinds of import statements, for a small artificial data set like the one shown in a commented line. The full error trace, the source code of csv reading code (function 'read_csv_file_as_list' is my own) as well as a sample dataset (the original dataset is about 8M large) can be provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error
[ https://issues.apache.org/jira/browse/SPARK-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2348. -- Resolution: Not a Problem In Windows having a enviorinment variable named 'classpath' gives error --- Key: SPARK-2348 URL: https://issues.apache.org/jira/browse/SPARK-2348 Project: Spark Issue Type: Bug Components: Windows Affects Versions: 1.0.0 Environment: Windows 7 Enterprise Reporter: Chirag Todarka Assignee: Chirag Todarka Operating System:: Windows 7 Enterprise If having enviorinment variable named 'classpath' gives then starting 'spark-shell' gives below error:: mydir\spark\binspark-shell Failed to initialize compiler: object scala.runtime in compiler mirror not found . ** Note that as of 2.8 scala does not assume use of the java classpath. ** For the old behavior pass -usejavacp to scala, or if using a Settings ** object programatically, settings.usejavacp.value = true. 14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler acces sed before init set up. Assuming no postInit code. Failed to initialize compiler: object scala.runtime in compiler mirror not found . ** Note that as of 2.8 scala does not assume use of the java classpath. ** For the old behavior pass -usejavacp to scala, or if using a Settings ** object programatically, settings.usejavacp.value = true. Exception in thread main java.lang.AssertionError: assertion failed: null at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca la:202) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar kILoop.scala:929) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop. scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop. scala:884) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass Loader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6605) Same transformation in DStream leads to different result
[ https://issues.apache.org/jira/browse/SPARK-6605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386568#comment-14386568 ] Sean Owen commented on SPARK-6605: -- Thanks, that's very useful. I think the behavior is expected, but it's not obvious. I assume you are printing the RDD from a window with no data. Both are giving the same answer in that both show no count, or 0, for every key. The second example just has an explicit 0 in two cases instead of all implicit 0. The more expected answer is the first one -- no results. The first version gets that exactly since it re-counts the whole window which has no data. The second one is the result of the optimization offered by invFunc. It correctly finds the count is 0 in the current window for these two keys, but it has no notion that a count of 0 is the same as no value at all. You and I know that, and you could simply apply a filter() to remove these redundant entries if desired. I'm not sure it's fixable in general without the user being able to supply a {{(V,V) = Option[V]}} instead or something as the {{invFunc}}. But it's not really getting the wrong answer either. Same transformation in DStream leads to different result Key: SPARK-6605 URL: https://issues.apache.org/jira/browse/SPARK-6605 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: SaintBacchus Fix For: 1.4.0 The transformation *reduceByKeyAndWindow* has two implementations: one use the *WindowDstream* and the other use *ReducedWindowedDStream*. But the result always is the same, except when an empty windows occurs. As a wordcount example, if a period of time (larger than window time) has no data coming, the first *reduceByKeyAndWindow* has no elem inside but the second has many elem with the zero value inside. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6605) Same transformation in DStream leads to different result
[ https://issues.apache.org/jira/browse/SPARK-6605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386610#comment-14386610 ] SaintBacchus commented on SPARK-6605: - Yeah, [~srowen] it's not a wrong answer but just a little different from what we expect. It's caused by two different implementations. But I doubt whether we should fix it as the first case or let users deal with the empty result using *filter*. If we want to fix it, setting the {{invFunc}} as {{(V,V) = Option\[V\]}} is a good idea or add a {{Filter Function}} is also OK for simple. Same transformation in DStream leads to different result Key: SPARK-6605 URL: https://issues.apache.org/jira/browse/SPARK-6605 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: SaintBacchus Fix For: 1.4.0 The transformation *reduceByKeyAndWindow* has two implementations: one use the *WindowDstream* and the other use *ReducedWindowedDStream*. But the result always is the same, except when an empty windows occurs. As a wordcount example, if a period of time (larger than window time) has no data coming, the first *reduceByKeyAndWindow* has no elem inside but the second has many elem with the zero value inside. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6603) SQLContext.registerFunction - SQLContext.udf.register
[ https://issues.apache.org/jira/browse/SPARK-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6603: - Component/s: SQL SQLContext.registerFunction - SQLContext.udf.register -- Key: SPARK-6603 URL: https://issues.apache.org/jira/browse/SPARK-6603 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Davies Liu We didn't change the Python implementation to use that. Maybe the best strategy is to deprecate SQLContext.registerFunction, and just add SQLContext.udf.register. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5750) Document that ordering of elements in shuffled partitions is not deterministic across runs
[ https://issues.apache.org/jira/browse/SPARK-5750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5750: - Priority: Minor (was: Major) Document that ordering of elements in shuffled partitions is not deterministic across runs -- Key: SPARK-5750 URL: https://issues.apache.org/jira/browse/SPARK-5750 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Josh Rosen Assignee: Ilya Ganelin Priority: Minor Fix For: 1.3.1, 1.4.0 The ordering of elements in shuffled partitions is not deterministic across runs. For instance, consider the following example: {code} val largeFiles = sc.textFile(...) val airlines = largeFiles.repartition(2000).cache() println(airlines.first) {code} If this code is run twice, then each run will output a different result. There is non-determinism in the shuffle read code that accounts for this: Spark's shuffle read path processes blocks as soon as they are fetched Spark uses [ShuffleBlockFetcherIterator|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala] to fetch shuffle data from mappers. In this code, requests for multiple blocks from the same host are batched together, so nondeterminism in where tasks are run means that the set of requests can vary across runs. In addition, there's an [explicit call|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L256] to randomize the order of the batched fetch requests. As a result, shuffle operations cannot be guaranteed to produce the same ordering of the elements in their partitions. Therefore, Spark should update its docs to clarify that the ordering of elements in shuffle RDDs' partitions is non-deterministic. Note, however, that the _set_ of elements in each partition will be deterministic: if we used {{mapPartitions}} to sort each partition, then the {{first()}} call above would produce a deterministic result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5836) Highlight in Spark documentation that by default Spark does not delete its temporary files
[ https://issues.apache.org/jira/browse/SPARK-5836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5836. -- Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Assignee: Ilya Ganelin Highlight in Spark documentation that by default Spark does not delete its temporary files -- Key: SPARK-5836 URL: https://issues.apache.org/jira/browse/SPARK-5836 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Tomasz Dudziak Assignee: Ilya Ganelin Fix For: 1.3.1, 1.4.0 We recently learnt the hard way (in a prod system) that Spark by default does not delete its temporary files until it is stopped. WIthin a relatively short time span of heavy Spark use the disk of our prod machine filled up completely because of multiple shuffle files written to it. We think there should be better documentation around the fact that after a job is finished it leaves a lot of rubbish behind so that this does not come as a surprise. Probably a good place to highlight that fact would be the documentation of {{spark.local.dir}} property, which controls where Spark temporary files are written. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6607) Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema
[ https://issues.apache.org/jira/browse/SPARK-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386560#comment-14386560 ] Apache Spark commented on SPARK-6607: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/5263 Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema -- Key: SPARK-6607 URL: https://issues.apache.org/jira/browse/SPARK-6607 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh '(' and ')' are special characters used in Parquet schema for type annotation. When we run an aggregation query, we will obtain attribute name such as MAX(a). If we directly store the generated DataFrame as Parquet file, it causes failure when reading and parsing the stored schema string. Several methods can be adopted to solve this. This pr uses a simplest one to just replace attribute names before generating Parquet schema based on these attributes. Another possible method might be modifying all aggregation expression names from func(column) to func[column]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6607) Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema
[ https://issues.apache.org/jira/browse/SPARK-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6607: --- Assignee: (was: Apache Spark) Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema -- Key: SPARK-6607 URL: https://issues.apache.org/jira/browse/SPARK-6607 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh '(' and ')' are special characters used in Parquet schema for type annotation. When we run an aggregation query, we will obtain attribute name such as MAX(a). If we directly store the generated DataFrame as Parquet file, it causes failure when reading and parsing the stored schema string. Several methods can be adopted to solve this. This pr uses a simplest one to just replace attribute names before generating Parquet schema based on these attributes. Another possible method might be modifying all aggregation expression names from func(column) to func[column]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6607) Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema
[ https://issues.apache.org/jira/browse/SPARK-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6607: --- Assignee: Apache Spark Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema -- Key: SPARK-6607 URL: https://issues.apache.org/jira/browse/SPARK-6607 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh Assignee: Apache Spark '(' and ')' are special characters used in Parquet schema for type annotation. When we run an aggregation query, we will obtain attribute name such as MAX(a). If we directly store the generated DataFrame as Parquet file, it causes failure when reading and parsing the stored schema string. Several methods can be adopted to solve this. This pr uses a simplest one to just replace attribute names before generating Parquet schema based on these attributes. Another possible method might be modifying all aggregation expression names from func(column) to func[column]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-799) Windows versions of the deploy scripts
[ https://issues.apache.org/jira/browse/SPARK-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386571#comment-14386571 ] Steve Loughran commented on SPARK-799: -- Proving python versions of the launcher scripts is probably a better approach to supporting windows than .cmd or .ps1 files # {{cmd}} is a painfully dated shell language, as you note. # Powershell is better, but not widely known in the java/scala dev space. # Neither ps1 nor cmd files can be tested except in Windows Python is cross-platform-ish enough that it can be tested on Unix systems too, and more likely to be maintained. It's not seamlessly cross-platform; propagating stdout/stderr from spawned java processes. It also provides the option of becoming the Unix entry point (the bash script simply invoking it), so that maintenance effort is shared, and testing becomes even more implicit. Windows versions of the deploy scripts -- Key: SPARK-799 URL: https://issues.apache.org/jira/browse/SPARK-799 Project: Spark Issue Type: Bug Components: Deploy, Windows Reporter: Matei Zaharia Labels: Starter Although the Spark daemons run fine on Windows with run.cmd, the deploy scripts (bin/start-all.sh and such) don't do so unless you have Cygwin. It would be nice to make .cmd versions of those. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6282) Strange Python import error when using random() in a lambda function
[ https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386564#comment-14386564 ] Dmytro Bielievtsov edited comment on SPARK-6282 at 3/30/15 11:18 AM: - I had the same strange issue when computing rdd's via blaze interface (ContinuumIO). As strangely as it appeared, It disappeared after I removed 'six*' files from '/usr/lib/python-2.7/dist-packages' and then reinstalled 'six' with easy_install-2.7 was (Author: dmytro): I had the same strange issue when computing rdd's via blaze interface (ContinuumIO). As strangely as it appeared, It strangely disappeared after I removed 'six*' files from '/usr/lib/python-2.7/dist-packages' and then reinstalled 'six' with easy_install-2.7 Strange Python import error when using random() in a lambda function Key: SPARK-6282 URL: https://issues.apache.org/jira/browse/SPARK-6282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Kubuntu 14.04, Python 2.7.6 Reporter: Pavel Laskov Priority: Minor Consider the exemplary Python code below: from random import random from pyspark.context import SparkContext from xval_mllib import read_csv_file_as_list if __name__ == __main__: sc = SparkContext(appName=Random() bug test) data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv')) #data = sc.parallelize([1, 2, 3, 4, 5], 2) d = data.map(lambda x: (random(), x)) print d.first() Data is read from a large CSV file. Running this code results in a Python import error: ImportError: No module named _winreg If I use 'import random' and 'random.random()' in the lambda function no error occurs. Also no error occurs, for both kinds of import statements, for a small artificial data set like the one shown in a commented line. The full error trace, the source code of csv reading code (function 'read_csv_file_as_list' is my own) as well as a sample dataset (the original dataset is about 8M large) can be provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop
[ https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386585#comment-14386585 ] Steve Loughran commented on SPARK-2356: --- It's coming from {{ UserGroupInformation.setConfiguration(conf)}}; UGI is using Hadoop's {{StringUtils}} to do something, which then init's a static variable {code} public static final Pattern ENV_VAR_PATTERN = Shell.WINDOWS ? WIN_ENV_VAR_PATTERN : SHELL_ENV_VAR_PATTERN; {code} And Hadoop utils shell, does some stuff in its constructor, which depends on winutils.exe being on the path. convoluted, but there you go. HADOOP-11293 proposes factoring out the {{Shell.Windows}} code into something standalone...if that can be pushed into Hadoop 2.8 then this problem will go away from then on Exception: Could not locate executable null\bin\winutils.exe in the Hadoop --- Key: SPARK-2356 URL: https://issues.apache.org/jira/browse/SPARK-2356 Project: Spark Issue Type: Bug Components: Windows Affects Versions: 1.0.0 Reporter: Kostiantyn Kudriavtsev Priority: Critical I'm trying to run some transformation on Spark, it works fine on cluster (YARN, linux machines). However, when I'm trying to run it on local machine (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file from local filesystem): {code} 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.init(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) at org.apache.spark.SparkContext.init(SparkContext.scala:97) {code} It's happened because Hadoop config is initialized each time when spark context is created regardless is hadoop required or not. I propose to add some special flag to indicate if hadoop config is required (or start this configuration manually) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6598) Python API for IDFModel
[ https://issues.apache.org/jira/browse/SPARK-6598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386619#comment-14386619 ] Apache Spark commented on SPARK-6598: - User 'Lewuathe' has created a pull request for this issue: https://github.com/apache/spark/pull/5264 Python API for IDFModel --- Key: SPARK-6598 URL: https://issues.apache.org/jira/browse/SPARK-6598 Project: Spark Issue Type: Task Components: MLlib Affects Versions: 1.4.0 Reporter: Kai Sasaki Priority: Minor Labels: MLLib,, Python This is the sub-task of [SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254]. Wrap IDFModel {{idf}} member function for pyspark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6598) Python API for IDFModel
[ https://issues.apache.org/jira/browse/SPARK-6598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6598: --- Assignee: (was: Apache Spark) Python API for IDFModel --- Key: SPARK-6598 URL: https://issues.apache.org/jira/browse/SPARK-6598 Project: Spark Issue Type: Task Components: MLlib Affects Versions: 1.4.0 Reporter: Kai Sasaki Priority: Minor Labels: MLLib,, Python This is the sub-task of [SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254]. Wrap IDFModel {{idf}} member function for pyspark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6598) Python API for IDFModel
[ https://issues.apache.org/jira/browse/SPARK-6598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6598: --- Assignee: Apache Spark Python API for IDFModel --- Key: SPARK-6598 URL: https://issues.apache.org/jira/browse/SPARK-6598 Project: Spark Issue Type: Task Components: MLlib Affects Versions: 1.4.0 Reporter: Kai Sasaki Assignee: Apache Spark Priority: Minor Labels: MLLib,, Python This is the sub-task of [SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254]. Wrap IDFModel {{idf}} member function for pyspark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods
[ https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386625#comment-14386625 ] Kalle Jepsen commented on SPARK-6189: - I do not really understand why the column names have to be accessible directly as attributes anyway. What advantage does this yield above indexing? This basically restricts us on the ASCII character set for column names, doesn't it? Data in the wild may have all kinds of weird field names, including special characters, umlauts, accents and whatnot. Automatic renaming isn't very nice too, for the very reason already pointed out by mgdadv. Also, we cannot simply replace all illegal characters by underscores. The fields {{'ä.ö'}} and {{'ä.ü'}} would both be renamed to {{'___'}}. Besides, leading underscores have a somewhat special meaning in Python, potentially resulting in further confusion. I think {{df\['a.b'\]}} should definitely work, even if the columns contain non-ASCII characters and a warning should be issued when creating the DataFrame, informing the user that direct column access via attribute name will not work with the given column names. Pandas to DataFrame conversion should check field names for periods --- Key: SPARK-6189 URL: https://issues.apache.org/jira/browse/SPARK-6189 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Issue I ran into: I imported an R dataset in CSV format into a Pandas DataFrame and then use toDF() to convert that into a Spark DataFrame. The R dataset had a column with a period in it (column GNP.deflator in the longley dataset). When I tried to select it using the Spark DataFrame DSL, I could not because the DSL thought the period was selecting a field within GNP. Also, since GNP is another field's name, it gives an error which could be obscure to users, complaining: {code} org.apache.spark.sql.AnalysisException: GetField is not valid on fields of type DoubleType; {code} We should either handle periods in column names or check during loading and warn/fail gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6605) Same transformation in DStream leads to different result
[ https://issues.apache.org/jira/browse/SPARK-6605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386540#comment-14386540 ] SaintBacchus commented on SPARK-6605: - Hi [~srowen], my test code is this : {code:title=test.scala|borderStyle=solid} val words = ssc.socketTextStream(sIp , sPort).flatMap(_.split( )).map(x = (x , 1)) val resultWindow3 = words.reduceByKeyAndWindow((a:Int,b:Int) = (a + b), Seconds(winDur), Seconds(slideDur) ); val resultWindow4 = words.reduceByKeyAndWindow(_ + _, _ - _, Seconds(winDur), Seconds(slideDur) ); {code} This *resultWindow3* is implemented by {code:title=PairDStreamFunctions.scala|borderStyle=solid} val cleanedReduceFunc = ssc.sc.clean(reduceFunc) self.reduceByKey(cleanedReduceFunc, partitioner) .window(windowDuration, slideDuration) .reduceByKey(cleanedReduceFunc, partitioner) {code} And *resultWindow4* is implemented by {code:title=PairDStreamFunctions.scala|borderStyle=solid} val cleanedReduceFunc = ssc.sc.clean(reduceFunc) val cleanedInvReduceFunc = ssc.sc.clean(invReduceFunc) val cleanedFilterFunc = if (filterFunc != null) Some(ssc.sc.clean(filterFunc)) else None new ReducedWindowedDStream[K, V]( self, cleanedReduceFunc, cleanedInvReduceFunc, cleanedFilterFunc, windowDuration, slideDuration, partitioner ) {code} The result of this test code is: {quote} = resultWindow3 is: = resultWindow4 is: (hello,0) (world,0) {quote} *resultWindow3* is empty but *resultWindow4* has two elements whose keys were received before. Same transformation in DStream leads to different result Key: SPARK-6605 URL: https://issues.apache.org/jira/browse/SPARK-6605 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: SaintBacchus Fix For: 1.4.0 The transformation *reduceByKeyAndWindow* has two implementations: one use the *WindowDstream* and the other use *ReducedWindowedDStream*. But the result always is the same, except when an empty windows occurs. As a wordcount example, if a period of time (larger than window time) has no data coming, the first *reduceByKeyAndWindow* has no elem inside but the second has many elem with the zero value inside. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle
[ https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386541#comment-14386541 ] Sean Owen commented on SPARK-3441: -- This is mentioned in the change for https://github.com/apache/spark/pull/5074 but I think the work here is to explain more deeply the rationale and partitioner details in the scaladoc Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle --- Key: SPARK-3441 URL: https://issues.apache.org/jira/browse/SPARK-3441 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Patrick Wendell Assignee: Sandy Ryza I think it would be good to say something like this in the doc for repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy: {code} This can be used to enact a Hadoop Style shuffle along with a call to mapPartitions, e.g.: rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...) {code} It might also be nice to add a version that doesn't take a partitioner and/or to mention this in the groupBy javadoc. I guess it depends a bit whether we consider this to be an API we want people to use more widely or whether we just consider it a narrow stable API mostly for Hive-on-Spark. If we want people to consider this API when porting workloads from Hadoop, then it might be worth documenting better. What do you think [~rxin] and [~matei]? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5836) Highlight in Spark documentation that by default Spark does not delete its temporary files
[ https://issues.apache.org/jira/browse/SPARK-5836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5836: - Priority: Minor (was: Major) Highlight in Spark documentation that by default Spark does not delete its temporary files -- Key: SPARK-5836 URL: https://issues.apache.org/jira/browse/SPARK-5836 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Tomasz Dudziak Assignee: Ilya Ganelin Priority: Minor Fix For: 1.3.1, 1.4.0 We recently learnt the hard way (in a prod system) that Spark by default does not delete its temporary files until it is stopped. WIthin a relatively short time span of heavy Spark use the disk of our prod machine filled up completely because of multiple shuffle files written to it. We think there should be better documentation around the fact that after a job is finished it leaves a lot of rubbish behind so that this does not come as a surprise. Probably a good place to highlight that fact would be the documentation of {{spark.local.dir}} property, which controls where Spark temporary files are written. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6282) Strange Python import error when using random() in a lambda function
[ https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386564#comment-14386564 ] Dmytro Bielievtsov edited comment on SPARK-6282 at 3/30/15 11:18 AM: - I had the same strange issue when computing rdd's via blaze interface (ContinuumIO). As strangely as it appeared, It disappeared after I removed 'six*' files from '/usr/lib/python2.7/dist-packages' and then reinstalled 'six' with easy_install-2.7 was (Author: dmytro): I had the same strange issue when computing rdd's via blaze interface (ContinuumIO). As strangely as it appeared, It disappeared after I removed 'six*' files from '/usr/lib/python-2.7/dist-packages' and then reinstalled 'six' with easy_install-2.7 Strange Python import error when using random() in a lambda function Key: SPARK-6282 URL: https://issues.apache.org/jira/browse/SPARK-6282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Kubuntu 14.04, Python 2.7.6 Reporter: Pavel Laskov Priority: Minor Consider the exemplary Python code below: from random import random from pyspark.context import SparkContext from xval_mllib import read_csv_file_as_list if __name__ == __main__: sc = SparkContext(appName=Random() bug test) data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv')) #data = sc.parallelize([1, 2, 3, 4, 5], 2) d = data.map(lambda x: (random(), x)) print d.first() Data is read from a large CSV file. Running this code results in a Python import error: ImportError: No module named _winreg If I use 'import random' and 'random.random()' in the lambda function no error occurs. Also no error occurs, for both kinds of import statements, for a small artificial data set like the one shown in a commented line. The full error trace, the source code of csv reading code (function 'read_csv_file_as_list' is my own) as well as a sample dataset (the original dataset is about 8M large) can be provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4226) SparkSQL - Add support for subqueries in predicates
[ https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4226: --- Assignee: (was: Apache Spark) SparkSQL - Add support for subqueries in predicates --- Key: SPARK-4226 URL: https://issues.apache.org/jira/browse/SPARK-4226 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Environment: Spark 1.2 snapshot Reporter: Terry Siu I have a test table defined in Hive as follows: CREATE TABLE sparkbug ( id INT, event STRING ) STORED AS PARQUET; and insert some sample data with ids 1, 2, 3. In a Spark shell, I then create a HiveContext and then execute the following HQL to test out subquery predicates: val hc = HiveContext(hc) hc.hql(select customerid from sparkbug where customerid in (select customerid from sparkbug where customerid in (2,3))) I get the following error: java.lang.RuntimeException: Unsupported language features in query: select customerid from sparkbug where customerid in (select customerid from sparkbug where customerid in (2,3)) TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_SUBQUERY_EXPR TOK_SUBQUERY_OP in TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_FUNCTION in TOK_TABLE_OR_COL customerid 2 3 TOK_TABLE_OR_COL customerid scala.NotImplementedError: No parse rules for ASTNode type: 817, text: TOK_SUBQUERY_EXPR : TOK_SUBQUERY_EXPR TOK_SUBQUERY_OP in TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_FUNCTION in TOK_TABLE_OR_COL customerid 2 3 TOK_TABLE_OR_COL customerid + org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098) at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) This thread http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html also brings up lack of subquery support in SparkSQL. It would be nice to have subquery predicate support in a near, future release (1.3, maybe?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6608) Make DataFrame.rdd a lazy val
Cheng Lian created SPARK-6608: - Summary: Make DataFrame.rdd a lazy val Key: SPARK-6608 URL: https://issues.apache.org/jira/browse/SPARK-6608 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Priority: Minor Before 1.3.0, {{SchemaRDD.id}} works as a unique identifier of each {{SchemaRDD}}. In 1.3.0, unlike {{SchemaRDD}}, {{DataFrame}} is no longer an RDD, and {{DataFrame.rdd}} is actually a function which always return a new RDD instance. Making {{DataFrame.rdd}} a {{lazy val}} should bring the unique identifier back. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5456) Decimal Type comparison issue
[ https://issues.apache.org/jira/browse/SPARK-5456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386628#comment-14386628 ] Karthik Gorthi commented on SPARK-5456: --- One workaround we followed is to convert all Decimal Datatype columns to integer (truncated). But, this is not possible when the application needs to connect to third-party databases were the datatype is obviously not under our control. So, this is a serious bug, IMHO, which prohibits using Spark in cases, unless there is a workaround? Decimal Type comparison issue - Key: SPARK-5456 URL: https://issues.apache.org/jira/browse/SPARK-5456 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Kuldeep Not quite able to figure this out but here is a junit test to reproduce this, in JavaAPISuite.java {code:title=DecimalBug.java} @Test public void decimalQueryTest() { ListRow decimalTable = new ArrayListRow(); decimalTable.add(RowFactory.create(new BigDecimal(1), new BigDecimal(2))); decimalTable.add(RowFactory.create(new BigDecimal(3), new BigDecimal(4))); JavaRDDRow rows = sc.parallelize(decimalTable); ListStructField fields = new ArrayListStructField(7); fields.add(DataTypes.createStructField(a, DataTypes.createDecimalType(), true)); fields.add(DataTypes.createStructField(b, DataTypes.createDecimalType(), true)); sqlContext.applySchema(rows.rdd(), DataTypes.createStructType(fields)).registerTempTable(foo); Assert.assertEquals(sqlContext.sql(select * from foo where a 0).collectAsList(), decimalTable); } {code} Fails with java.lang.ClassCastException: java.math.BigDecimal cannot be cast to org.apache.spark.sql.types.Decimal -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6596) fix the instruction on building scaladoc
[ https://issues.apache.org/jira/browse/SPARK-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6596. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5253 [https://github.com/apache/spark/pull/5253] fix the instruction on building scaladoc - Key: SPARK-6596 URL: https://issues.apache.org/jira/browse/SPARK-6596 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Nan Zhu Fix For: 1.4.0 In README.md under docs/ directory, it says that You can build just the Spark scaladoc by running build/sbt doc from the SPARK_PROJECT_ROOT directory. I guess the right approach is build/sbt unidoc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6596) fix the instruction on building scaladoc
[ https://issues.apache.org/jira/browse/SPARK-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6596: - Priority: Trivial (was: Major) Assignee: Nan Zhu fix the instruction on building scaladoc - Key: SPARK-6596 URL: https://issues.apache.org/jira/browse/SPARK-6596 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Nan Zhu Assignee: Nan Zhu Priority: Trivial Fix For: 1.4.0 In README.md under docs/ directory, it says that You can build just the Spark scaladoc by running build/sbt doc from the SPARK_PROJECT_ROOT directory. I guess the right approach is build/sbt unidoc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6595) DataFrame self joins with MetastoreRelations fail
[ https://issues.apache.org/jira/browse/SPARK-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6595. --- Resolution: Fixed Fixed by https://github.com/apache/spark/pull/5251 DataFrame self joins with MetastoreRelations fail - Key: SPARK-6595 URL: https://issues.apache.org/jira/browse/SPARK-6595 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6609) explicit checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-6609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6609. -- Resolution: Invalid [~lisendong] Actually you were not even pointing at the 1.3 branch. SPARK-5955 did indeed already resolve this. You probably want to look at master or at least the branch you are saying is affected in the future. explicit checkpoint does not work - Key: SPARK-6609 URL: https://issues.apache.org/jira/browse/SPARK-6609 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.3.0 Reporter: lisendong Priority: Minor Original Estimate: 24h Remaining Estimate: 24h https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala from the code, I found the explicit feedback ALS does not do checkpoint() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6602) Replace direct use of Akka with Spark RPC interface
[ https://issues.apache.org/jira/browse/SPARK-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6602: --- Assignee: Shixiong Zhu (was: Apache Spark) Replace direct use of Akka with Spark RPC interface --- Key: SPARK-6602 URL: https://issues.apache.org/jira/browse/SPARK-6602 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6602) Replace direct use of Akka with Spark RPC interface
[ https://issues.apache.org/jira/browse/SPARK-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386875#comment-14386875 ] Apache Spark commented on SPARK-6602: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/5268 Replace direct use of Akka with Spark RPC interface --- Key: SPARK-6602 URL: https://issues.apache.org/jira/browse/SPARK-6602 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6602) Replace direct use of Akka with Spark RPC interface
[ https://issues.apache.org/jira/browse/SPARK-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6602: --- Assignee: Apache Spark (was: Shixiong Zhu) Replace direct use of Akka with Spark RPC interface --- Key: SPARK-6602 URL: https://issues.apache.org/jira/browse/SPARK-6602 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4123) Show dependency changes in pull requests
[ https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386888#comment-14386888 ] Apache Spark commented on SPARK-4123: - User 'brennonyork' has created a pull request for this issue: https://github.com/apache/spark/pull/5269 Show dependency changes in pull requests Key: SPARK-4123 URL: https://issues.apache.org/jira/browse/SPARK-4123 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell Assignee: Brennon York Priority: Critical Fix For: 1.4.0 We should inspect the classpath of Spark's assembly jar for every pull request. This only takes a few seconds in Maven and it will help weed out dependency changes from the master branch. Ideally we'd post any dependency changes in the pull request message. {code} $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk -F/ '{print $NF}' | sort my-classpath $ git checkout apache/master $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk -F/ '{print $NF}' | sort master-classpath $ diff my-classpath master-classpath chill-java-0.3.6.jar chill_2.10-0.3.6.jar --- chill-java-0.5.0.jar chill_2.10-0.5.0.jar {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5990) Model import/export for IsotonicRegression
[ https://issues.apache.org/jira/browse/SPARK-5990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386936#comment-14386936 ] Yanbo Liang commented on SPARK-5990: [~josephkb] Could you assign this to me? Model import/export for IsotonicRegression -- Key: SPARK-5990 URL: https://issues.apache.org/jira/browse/SPARK-5990 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Add save/load for IsotonicRegressionModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4226) SparkSQL - Add support for subqueries in predicates
[ https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-4226: -- Description: I have a test table defined in Hive as follows: {code:sql} CREATE TABLE sparkbug ( id INT, event STRING ) STORED AS PARQUET; {code} and insert some sample data with ids 1, 2, 3. In a Spark shell, I then create a HiveContext and then execute the following HQL to test out subquery predicates: {code} val hc = HiveContext(hc) hc.hql(select customerid from sparkbug where customerid in (select customerid from sparkbug where customerid in (2,3))) {code} I get the following error: {noformat} java.lang.RuntimeException: Unsupported language features in query: select customerid from sparkbug where customerid in (select customerid from sparkbug where customerid in (2,3)) TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_SUBQUERY_EXPR TOK_SUBQUERY_OP in TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_FUNCTION in TOK_TABLE_OR_COL customerid 2 3 TOK_TABLE_OR_COL customerid scala.NotImplementedError: No parse rules for ASTNode type: 817, text: TOK_SUBQUERY_EXPR : TOK_SUBQUERY_EXPR TOK_SUBQUERY_OP in TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_FUNCTION in TOK_TABLE_OR_COL customerid 2 3 TOK_TABLE_OR_COL customerid + org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098) at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) {noformat} [This thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html] also brings up lack of subquery support in SparkSQL. It would be nice to have subquery predicate support in a near, future release (1.3, maybe?). was: I have a test table defined in Hive as follows: CREATE TABLE sparkbug ( id INT, event STRING ) STORED AS PARQUET; and insert some sample data with ids 1, 2, 3. In a Spark shell, I then create a HiveContext and then execute the following HQL to test out subquery predicates: val hc = HiveContext(hc) hc.hql(select customerid from sparkbug where customerid in (select customerid from sparkbug where customerid in (2,3))) I get the following error: java.lang.RuntimeException: Unsupported language features in query: select customerid from sparkbug where customerid in (select customerid from sparkbug where customerid in (2,3)) TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_SUBQUERY_EXPR TOK_SUBQUERY_OP in TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_FUNCTION in TOK_TABLE_OR_COL customerid 2 3 TOK_TABLE_OR_COL customerid scala.NotImplementedError: No parse rules for ASTNode type: 817, text: TOK_SUBQUERY_EXPR : TOK_SUBQUERY_EXPR TOK_SUBQUERY_OP in TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_FUNCTION in TOK_TABLE_OR_COL
[jira] [Assigned] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2883: --- Assignee: (was: Apache Spark) Spark Support for ORCFile format Key: SPARK-2883 URL: https://issues.apache.org/jira/browse/SPARK-2883 Project: Spark Issue Type: Bug Components: Input/Output, SQL Reporter: Zhan Zhang Priority: Blocker Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 pm jobtracker.png, orc.diff Verify the support of OrcInputFormat in spark, fix issues if exists and add documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2883: --- Assignee: Apache Spark Spark Support for ORCFile format Key: SPARK-2883 URL: https://issues.apache.org/jira/browse/SPARK-2883 Project: Spark Issue Type: Bug Components: Input/Output, SQL Reporter: Zhan Zhang Assignee: Apache Spark Priority: Blocker Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 pm jobtracker.png, orc.diff Verify the support of OrcInputFormat in spark, fix issues if exists and add documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6258) Python MLlib API missing items: Clustering
[ https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387115#comment-14387115 ] Joseph K. Bradley commented on SPARK-6258: -- [~hrishikesh], glad to hear you're interested! I'd recommend picking off one of these tasks. I just created another JIRA for part of this task which should be a good one to start with: [SPARK-6612] Does that sound good? Also, please check out this guide; we try to follow these guidelines closely: [https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark] If you have implementation questions, we can discuss them on github after you send a PR. Thanks! Python MLlib API missing items: Clustering -- Key: SPARK-6258 URL: https://issues.apache.org/jira/browse/SPARK-6258 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. KMeans * setEpsilon * setInitializationSteps KMeansModel * computeCost * k GaussianMixture * setInitialModel GaussianMixtureModel * k Completely missing items which should be fixed in separate JIRAs (which have been created and linked to the umbrella JIRA) * LDA * PowerIterationClustering * StreamingKMeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6403) Launch master as spot instance on EC2
[ https://issues.apache.org/jira/browse/SPARK-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6403: --- Assignee: (was: Apache Spark) Launch master as spot instance on EC2 - Key: SPARK-6403 URL: https://issues.apache.org/jira/browse/SPARK-6403 Project: Spark Issue Type: New Feature Components: EC2 Affects Versions: 1.2.1 Reporter: Adam Vogel Priority: Minor Currently the spark_ec2.py script only supports requesting slaves as spot instances. Launching the master as a spot instance has potential cost savings, at the risk of losing the Spark cluster without warning. Unless users include logic for relaunching slaves when lost, it is usually the case that all slaves are lost simultaneously. Thus, for jobs which do not require resilience to losing spot instances, being able to launch the master as a spot instance saves money. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6611) Add support for INTEGER as synonym of INT to DDLParser
[ https://issues.apache.org/jira/browse/SPARK-6611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6611: --- Assignee: Apache Spark Add support for INTEGER as synonym of INT to DDLParser -- Key: SPARK-6611 URL: https://issues.apache.org/jira/browse/SPARK-6611 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Santiago M. Mola Assignee: Apache Spark Priority: Minor Add support for INTEGER as synonym of INT to DDLParser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6611) Add support for INTEGER as synonym of INT to DDLParser
[ https://issues.apache.org/jira/browse/SPARK-6611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6611: --- Assignee: (was: Apache Spark) Add support for INTEGER as synonym of INT to DDLParser -- Key: SPARK-6611 URL: https://issues.apache.org/jira/browse/SPARK-6611 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Santiago M. Mola Priority: Minor Add support for INTEGER as synonym of INT to DDLParser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6611) Add support for INTEGER as synonym of INT to DDLParser
[ https://issues.apache.org/jira/browse/SPARK-6611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386945#comment-14386945 ] Apache Spark commented on SPARK-6611: - User 'smola' has created a pull request for this issue: https://github.com/apache/spark/pull/5271 Add support for INTEGER as synonym of INT to DDLParser -- Key: SPARK-6611 URL: https://issues.apache.org/jira/browse/SPARK-6611 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Santiago M. Mola Priority: Minor Add support for INTEGER as synonym of INT to DDLParser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6403) Launch master as spot instance on EC2
[ https://issues.apache.org/jira/browse/SPARK-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6403: --- Assignee: Apache Spark Launch master as spot instance on EC2 - Key: SPARK-6403 URL: https://issues.apache.org/jira/browse/SPARK-6403 Project: Spark Issue Type: New Feature Components: EC2 Affects Versions: 1.2.1 Reporter: Adam Vogel Assignee: Apache Spark Priority: Minor Currently the spark_ec2.py script only supports requesting slaves as spot instances. Launching the master as a spot instance has potential cost savings, at the risk of losing the Spark cluster without warning. Unless users include logic for relaunching slaves when lost, it is usually the case that all slaves are lost simultaneously. Thus, for jobs which do not require resilience to losing spot instances, being able to launch the master as a spot instance saves money. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6239) Spark MLlib fpm#FPGrowth minSupport should use long instead
[ https://issues.apache.org/jira/browse/SPARK-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386933#comment-14386933 ] Sean Owen commented on SPARK-6239: -- Just the little API overhead for littler gain IMHO. I'm open to other committer opinions but just didn't seem worth changing. I would imagine a relative value is more usually useful. Spark MLlib fpm#FPGrowth minSupport should use long instead --- Key: SPARK-6239 URL: https://issues.apache.org/jira/browse/SPARK-6239 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Littlestar Priority: Minor Spark MLlib fpm#FPGrowth minSupport should use long instead == val minCount = math.ceil(minSupport * count).toLong because: 1. [count]numbers of datasets is not kown before read. 2. [minSupport ]double precision. from mahout#FPGrowthDriver.java addOption(minSupport, s, (Optional) The minimum number of times a co-occurrence must be present. + Default Value: 3, 3); I just want to set minCount=2 for test. Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387085#comment-14387085 ] Joseph K. Bradley commented on SPARK-5564: -- [~debasish83] I mainly used a Wikipedia dataset. Here's an S3 bucket (requestor pays) which [~sparks] created: [s3://files.sparks.requester.pays/enwiki_category_text/] which holds a big Wikipedia dataset. I'm not sure if it's the same one I used, but it should be similar qualitatively. Mine had ~1.1 billion tokens, with about 1 million documents and 1 million terms (vocab size). As far as scaling, the EM code scaled linearly with the number of topics K. Communication was the bottleneck for sizable datasets, and it scales linearly with K. The largest K I've run with on that dataset was K=100; that was using 16 r3.2xlarge workers. Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6611) Add support for INTEGER as synonym of INT to DDLParser
Santiago M. Mola created SPARK-6611: --- Summary: Add support for INTEGER as synonym of INT to DDLParser Key: SPARK-6611 URL: https://issues.apache.org/jira/browse/SPARK-6611 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Santiago M. Mola Priority: Minor Add support for INTEGER as synonym of INT to DDLParser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6603) SQLContext.registerFunction - SQLContext.udf.register
[ https://issues.apache.org/jira/browse/SPARK-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387105#comment-14387105 ] Reynold Xin commented on SPARK-6603: How about not deprecating registerFunction, and have both? So Scala users when migrating to Python don't get confused as much. SQLContext.registerFunction - SQLContext.udf.register -- Key: SPARK-6603 URL: https://issues.apache.org/jira/browse/SPARK-6603 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Davies Liu We didn't change the Python implementation to use that. Maybe the best strategy is to deprecate SQLContext.registerFunction, and just add SQLContext.udf.register. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6612) Python KMeans parity
Joseph K. Bradley created SPARK-6612: Summary: Python KMeans parity Key: SPARK-6612 URL: https://issues.apache.org/jira/browse/SPARK-6612 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor This is a subtask of [SPARK-6258] for the Python API of KMeans. These items are missing: KMeans * setEpsilon * setInitializationSteps KMeansModel * computeCost * k -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5990) Model import/export for IsotonicRegression
[ https://issues.apache.org/jira/browse/SPARK-5990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5990: --- Assignee: Apache Spark Model import/export for IsotonicRegression -- Key: SPARK-5990 URL: https://issues.apache.org/jira/browse/SPARK-5990 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Apache Spark Add save/load for IsotonicRegressionModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5990) Model import/export for IsotonicRegression
[ https://issues.apache.org/jira/browse/SPARK-5990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5990: --- Assignee: (was: Apache Spark) Model import/export for IsotonicRegression -- Key: SPARK-5990 URL: https://issues.apache.org/jira/browse/SPARK-5990 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Add save/load for IsotonicRegressionModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org