[jira] [Created] (SPARK-6599) Add Kinesis Direct API
Tathagata Das created SPARK-6599: Summary: Add Kinesis Direct API Key: SPARK-6599 URL: https://issues.apache.org/jira/browse/SPARK-6599 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6369) InsertIntoHiveTable and Parquet Relation should use logic from SparkHadoopWriter
[ https://issues.apache.org/jira/browse/SPARK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6369: Summary: InsertIntoHiveTable and Parquet Relation should use logic from SparkHadoopWriter (was: InsertIntoHiveTable should use logic from SparkHadoopWriter) InsertIntoHiveTable and Parquet Relation should use logic from SparkHadoopWriter Key: SPARK-6369 URL: https://issues.apache.org/jira/browse/SPARK-6369 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Right now it is possible that we will corrupt the output if there is a race between competing speculative tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh closed SPARK-6586. -- Resolution: Not a Problem Add the capability of retrieving original logical plan of DataFrame --- Key: SPARK-6586 URL: https://issues.apache.org/jira/browse/SPARK-6586 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan instead of logical plan. However, by doing that we can't know the logical plan of a {{DataFrame}}. But it might be still useful and important to retrieve the original logical plan in some use cases. In this pr, we introduce the capability of retrieving original logical plan of {{DataFrame}}. The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as {{true}}. In {{QueryExecution}}, we keep the original logical plan in the analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to recursively replace the analyzed logical plan with original logical plan and retrieve it. Besides the capability of retrieving original logical plan, this modification also can avoid do plan analysis if it is already analyzed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386209#comment-14386209 ] Liang-Chi Hsieh commented on SPARK-6586: I am no problem with your opinion. But If that is true, we don't need to keep logical plan in queryExecution now. I am closing this, thanks. Add the capability of retrieving original logical plan of DataFrame --- Key: SPARK-6586 URL: https://issues.apache.org/jira/browse/SPARK-6586 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan instead of logical plan. However, by doing that we can't know the logical plan of a {{DataFrame}}. But it might be still useful and important to retrieve the original logical plan in some use cases. In this pr, we introduce the capability of retrieving original logical plan of {{DataFrame}}. The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as {{true}}. In {{QueryExecution}}, we keep the original logical plan in the analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to recursively replace the analyzed logical plan with original logical plan and retrieve it. Besides the capability of retrieving original logical plan, this modification also can avoid do plan analysis if it is already analyzed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5203) union with different decimal type report error
[ https://issues.apache.org/jira/browse/SPARK-5203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5203: --- Assignee: (was: Apache Spark) union with different decimal type report error -- Key: SPARK-5203 URL: https://issues.apache.org/jira/browse/SPARK-5203 Project: Spark Issue Type: Bug Components: SQL Reporter: guowei Test case like this: {code:sql} create table test (a decimal(10,1)); select a from test union all select a*2 from test; {code} Exception thown: {noformat} 15/01/12 16:28:54 ERROR SparkSQLDriver: Failed in [select a from test union all select a*2 from test] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: 'Project [*] 'Subquery _u1 'Union Project [a#1] MetastoreRelation default, test, None Project [CAST((CAST(a#2, DecimalType()) * CAST(CAST(2, DecimalType(10,0)), DecimalType())), DecimalType(21,1)) AS _c0#0] MetastoreRelation default, test, None at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:85) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:83) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:81) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:410) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:410) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421) at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:369) at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:58) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386068#comment-14386068 ] Liang-Chi Hsieh commented on SPARK-6586: Even just for debuging purpose, I think it is important to provide a way to access the logical plan. The mutable states this added are limited in Logical Plan. If mutable states are bad here, I think I can refactor this to one without mutable states. Add the capability of retrieving original logical plan of DataFrame --- Key: SPARK-6586 URL: https://issues.apache.org/jira/browse/SPARK-6586 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan instead of logical plan. However, by doing that we can't know the logical plan of a {{DataFrame}}. But it might be still useful and important to retrieve the original logical plan in some use cases. In this pr, we introduce the capability of retrieving original logical plan of {{DataFrame}}. The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as {{true}}. In {{QueryExecution}}, we keep the original logical plan in the analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to recursively replace the analyzed logical plan with original logical plan and retrieve it. Besides the capability of retrieving original logical plan, this modification also can avoid do plan analysis if it is already analyzed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6601) Add HDFS NFS gateway module to spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-6601: --- Description: Add module hdfs-nfs-gateway, which sets up the gateway for (say, ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. Note: For nfs to be available outside AWS, also requires [#6600] was: Add module hdfs-nfs-gateway, which sets up the gateway for (say, ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. Note: For nfs to be available outside AWS, also requires #6600 Add HDFS NFS gateway module to spark-ec2 Key: SPARK-6601 URL: https://issues.apache.org/jira/browse/SPARK-6601 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Add module hdfs-nfs-gateway, which sets up the gateway for (say, ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. Note: For nfs to be available outside AWS, also requires [#6600] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6601) Add HDFS NFS gateway module to spark-ec2
Florian Verhein created SPARK-6601: -- Summary: Add HDFS NFS gateway module to spark-ec2 Key: SPARK-6601 URL: https://issues.apache.org/jira/browse/SPARK-6601 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Add module hdfs-nfs-gateway, which sets up the gateway for (say, ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. Note: For nfs to be available outside AWS, also requires #6600 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5203) union with different decimal type report error
[ https://issues.apache.org/jira/browse/SPARK-5203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5203: --- Assignee: Apache Spark union with different decimal type report error -- Key: SPARK-5203 URL: https://issues.apache.org/jira/browse/SPARK-5203 Project: Spark Issue Type: Bug Components: SQL Reporter: guowei Assignee: Apache Spark Test case like this: {code:sql} create table test (a decimal(10,1)); select a from test union all select a*2 from test; {code} Exception thown: {noformat} 15/01/12 16:28:54 ERROR SparkSQLDriver: Failed in [select a from test union all select a*2 from test] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: 'Project [*] 'Subquery _u1 'Union Project [a#1] MetastoreRelation default, test, None Project [CAST((CAST(a#2, DecimalType()) * CAST(CAST(2, DecimalType(10,0)), DecimalType())), DecimalType(21,1)) AS _c0#0] MetastoreRelation default, test, None at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:85) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:83) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:81) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:410) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:410) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421) at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:369) at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:58) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6598) Python API for IDFModel
Kai Sasaki created SPARK-6598: - Summary: Python API for IDFModel Key: SPARK-6598 URL: https://issues.apache.org/jira/browse/SPARK-6598 Project: Spark Issue Type: Task Components: MLlib Affects Versions: 1.4.0 Reporter: Kai Sasaki Priority: Minor This is the sub-task of [SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254]. Wrap IDFModel {{idf}} member function for pyspark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6594) Spark Streaming can't receive data from kafka
[ https://issues.apache.org/jira/browse/SPARK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386071#comment-14386071 ] q79969786 commented on SPARK-6594: -- 1. Create kafka topic as follow: $kafka-topics.sh --create --zookeeper zkServer1:2181,zkServer2:2181,zkServer3:2181 --replication-factor 1 --partitions 5 --topic ORDER 2. I'm use Java API to process data as follow: SparkConf sparkConf = new SparkConf().setAppName(TestOrder); sparkConf.set(spark.cleaner.ttl, 600); JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(1000)); MapString, Integer topicorder = new HashMapString, Integer(); topicorder.put(order, 5); JavaPairReceiverInputDStreamString, String jPRIDSOrder = KafkaUtils.createStream(jssc, ’zkServer1:2181,zkServer2:2181,zkServer3:2181‘, ’test-consumer-group‘, topicorder); jPRIDSOrder.map(new FunctionTuple2String, String, String() { @Override public String call(Tuple2String, String tuple2) { return tuple2._2(); } }).print(); 3. Submit this application as follow: spark-submit --class com.bigdata.TestOrder --master spark://SPKMASTER:19002 /home/bigdata/test-spark.jar TestOrder 4. It will shows five warnings as follows when submit application: 15/03/29 21:23:03 WARN ZookeeperConsumerConnector: [test-consumer-group_work1-1427462582342-5714642d], No broker partitions consumed by consumer thread test-consumer-group_work1-1427462582342-5714642d-0 for topic ORDER .. Spark Streaming can't receive data from kafka - Key: SPARK-6594 URL: https://issues.apache.org/jira/browse/SPARK-6594 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1 Reporter: q79969786 I use KafkaUtils to receive data from Kafka In my Spark streaming application as follows: MapString, Integer topicorder = new HashMapString, Integer(); topicorder.put(order, Integer.valueOf(readThread)); JavaPairReceiverInputDStreamString, String jPRIDSOrder = KafkaUtils.createStream(jssc, zkQuorum, group, topicorder); It worked well at fist, but after I submit this application several times, Spark streaming can‘t receive data anymore(Kafka works well). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6594) Spark Streaming can't receive data from kafka
[ https://issues.apache.org/jira/browse/SPARK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] q79969786 updated SPARK-6594: - Comment: was deleted (was: 1. Create kafka topic as follow: $kafka-topics.sh --create --zookeeper zkServer1:2181,zkServer2:2181,zkServer3:2181 --replication-factor 1 --partitions 5 --topic ORDER 2. I'm use Java API to process data as follow: SparkConf sparkConf = new SparkConf().setAppName(TestOrder); sparkConf.set(spark.cleaner.ttl, 600); JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(1000)); MapString, Integer topicorder = new HashMapString, Integer(); topicorder.put(order, 5); JavaPairReceiverInputDStreamString, String jPRIDSOrder = KafkaUtils.createStream(jssc, ’zkServer1:2181,zkServer2:2181,zkServer3:2181‘, ’test-consumer-group‘, topicorder); jPRIDSOrder.map(new FunctionTuple2String, String, String() { @Override public String call(Tuple2String, String tuple2) { return tuple2._2(); } }).print(); 3. Submit this application as follow: spark-submit --class com.bigdata.TestOrder --master spark://SPKMASTER:19002 /home/bigdata/test-spark.jar TestOrder 4. It will shows five warnings as follows when submit application: 15/03/29 21:23:03 WARN ZookeeperConsumerConnector: [test-consumer-group_work1-1427462582342-5714642d], No broker partitions consumed by consumer thread test-consumer-group_work1-1427462582342-5714642d-0 for topic ORDER .. ) Spark Streaming can't receive data from kafka - Key: SPARK-6594 URL: https://issues.apache.org/jira/browse/SPARK-6594 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1 Reporter: q79969786 I use KafkaUtils to receive data from Kafka In my Spark streaming application as follows: MapString, Integer topicorder = new HashMapString, Integer(); topicorder.put(order, Integer.valueOf(readThread)); JavaPairReceiverInputDStreamString, String jPRIDSOrder = KafkaUtils.createStream(jssc, zkQuorum, group, topicorder); It worked well at fist, but after I submit this application several times, Spark streaming can‘t receive data anymore(Kafka works well). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6566) Update Spark to use the latest version of Parquet libraries
[ https://issues.apache.org/jira/browse/SPARK-6566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386170#comment-14386170 ] Konstantin Shaposhnikov commented on SPARK-6566: Thank you for the update [~lian cheng] Update Spark to use the latest version of Parquet libraries --- Key: SPARK-6566 URL: https://issues.apache.org/jira/browse/SPARK-6566 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Konstantin Shaposhnikov There are a lot of bug fixes in the latest version of parquet (1.6.0rc7). E.g. PARQUET-136 It would be good to update Spark to use the latest parquet version. The following changes are required: {code} diff --git a/pom.xml b/pom.xml index 5ad39a9..095b519 100644 --- a/pom.xml +++ b/pom.xml @@ -132,7 +132,7 @@ !-- Version used for internal directory structure -- hive.version.short0.13.1/hive.version.short derby.version10.10.1.1/derby.version -parquet.version1.6.0rc3/parquet.version +parquet.version1.6.0rc7/parquet.version jblas.version1.2.3/jblas.version jetty.version8.1.14.v20131031/jetty.version orbit.version3.0.0.v201112011016/orbit.version {code} and {code} --- a/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala @@ -480,7 +480,7 @@ private[parquet] class FilteringParquetRowInputFormat globalMetaData = new GlobalMetaData(globalMetaData.getSchema, mergedMetadata, globalMetaData.getCreatedBy) -val readContext = getReadSupport(configuration).init( +val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init( new InitContext(configuration, globalMetaData.getKeyValueMetaData, globalMetaData.getSchema)) {code} I am happy to prepare a pull request if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386167#comment-14386167 ] Michael Armbrust commented on SPARK-6586: - I can see the utility for seeing the original plan in any given run of the optimizer. However, providing it for any arbitrarily assembly of query plans feels like unnecessary complexity to me. I think its only reasonable to add such instrumentation when it was actually useful to solve an issue. Doing so speculatively only leads to code complexity. If you have a concrete example where this information would be useful we can continue to discuss, but otherwise this issue should be closed. Additionally, PRs to add new features should *always* have tests. Otherwise these features will be broken almost immediately. Add the capability of retrieving original logical plan of DataFrame --- Key: SPARK-6586 URL: https://issues.apache.org/jira/browse/SPARK-6586 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan instead of logical plan. However, by doing that we can't know the logical plan of a {{DataFrame}}. But it might be still useful and important to retrieve the original logical plan in some use cases. In this pr, we introduce the capability of retrieving original logical plan of {{DataFrame}}. The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as {{true}}. In {{QueryExecution}}, we keep the original logical plan in the analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to recursively replace the analyzed logical plan with original logical plan and retrieve it. Besides the capability of retrieving original logical plan, this modification also can avoid do plan analysis if it is already analyzed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6261) Python MLlib API missing items: Feature
[ https://issues.apache.org/jira/browse/SPARK-6261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386036#comment-14386036 ] Kai Sasaki commented on SPARK-6261: --- [~josephkb] I created JIRA for IDFModel here. [SPARK-6598|https://issues.apache.org/jira/browse/SPARK-6598]. Thank you! Python MLlib API missing items: Feature --- Key: SPARK-6261 URL: https://issues.apache.org/jira/browse/SPARK-6261 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. StandardScalerModel * All functionality except predict() is missing. IDFModel * idf Word2Vec * setMinCount Word2VecModel * getVectors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386049#comment-14386049 ] Debasish Das commented on SPARK-5564: - [~josephkb] could you please point me to the datasets that are used for benchmarking? I have started testing loglikelihood loss for recommendation and since I already added the constraints, this is the right time to test it on LDA benchmarks as well...I will open up the code as part of https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears it... I am looking into LDA test-cases but since I am optimizing log-likelihood directly, I am looking to add more testcases from your LDA JIRA...For recommendation, I know how to construct the testcases... Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6600) Open ports in spark-ec2.py to allow HDFS NFS gateway
[ https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-6600: --- Description: Use case: User has set up the hadoop hdfs nfs gateway service on their spark-ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. That should be a separate issue (TODO). Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html was: Use case: User has set up the hadoop hdfs nfs gateway service on their spark-ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. That should be a separate issue (TODO). Reference: https://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html Open ports in spark-ec2.py to allow HDFS NFS gateway -- Key: SPARK-6600 URL: https://issues.apache.org/jira/browse/SPARK-6600 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Use case: User has set up the hadoop hdfs nfs gateway service on their spark-ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. That should be a separate issue (TODO). Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway
[ https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-6600: --- Summary: Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway(was: Open ports in spark-ec2.py to allow HDFS NFS gateway ) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway -- Key: SPARK-6600 URL: https://issues.apache.org/jira/browse/SPARK-6600 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Use case: User has set up the hadoop hdfs nfs gateway service on their spark-ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. That should be a separate issue (TODO). Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386049#comment-14386049 ] Debasish Das edited comment on SPARK-5564 at 3/30/15 12:31 AM: --- [~josephkb] could you please point me to the datasets that are used for benchmarking LDA and how do they scale as we start scaling the topics? I have started testing loglikelihood loss for recommendation and since I already added the constraints, this is the right time to test it on LDA benchmarks as well...I will open up the code as part of https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears it... I am looking into LDA test-cases but since I am optimizing log-likelihood directly, I am looking to add more testcases based on document and word matrix...For recommendation, I know how to construct the testcases with loglikelihood loss was (Author: debasish83): [~josephkb] could you please point me to the datasets that are used for benchmarking? I have started testing loglikelihood loss for recommendation and since I already added the constraints, this is the right time to test it on LDA benchmarks as well...I will open up the code as part of https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears it... I am looking into LDA test-cases but since I am optimizing log-likelihood directly, I am looking to add more testcases based on document and word matrix...For recommendation, I know how to construct the testcases with loglikelihood loss Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386049#comment-14386049 ] Debasish Das edited comment on SPARK-5564 at 3/30/15 12:30 AM: --- [~josephkb] could you please point me to the datasets that are used for benchmarking? I have started testing loglikelihood loss for recommendation and since I already added the constraints, this is the right time to test it on LDA benchmarks as well...I will open up the code as part of https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears it... I am looking into LDA test-cases but since I am optimizing log-likelihood directly, I am looking to add more testcases based on document and word matrix...For recommendation, I know how to construct the testcases with loglikelihood loss was (Author: debasish83): [~josephkb] could you please point me to the datasets that are used for benchmarking? I have started testing loglikelihood loss for recommendation and since I already added the constraints, this is the right time to test it on LDA benchmarks as well...I will open up the code as part of https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears it... I am looking into LDA test-cases but since I am optimizing log-likelihood directly, I am looking to add more testcases from your LDA JIRA...For recommendation, I know how to construct the testcases... Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway
[ https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-6600: --- Description: Use case: User has set up the hadoop hdfs nfs gateway service on their spark_ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. See [#6601] for this. Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html was: Use case: User has set up the hadoop hdfs nfs gateway service on their spark_ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. That should be a separate issue (TODO). Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway -- Key: SPARK-6600 URL: https://issues.apache.org/jira/browse/SPARK-6600 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Use case: User has set up the hadoop hdfs nfs gateway service on their spark_ec2.py launched cluster, and wants to mount that on their local machine. Requires the following ports to be opened on incoming rule set for MASTER for both UDP and TCP: 111, 2049, 4242. (I have tried this and it works) Note that this issue *does not* cover the implementation of a hdfs nfs gateway module in the spark-ec2 project. See [#6601] for this. Reference: https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6601) Add HDFS NFS gateway module to spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florian Verhein updated SPARK-6601: --- Description: Add module hdfs-nfs-gateway, which sets up the gateway for (say, ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. Note: For nfs to be available outside AWS, also requires #6600 was: Add module hdfs-nfs-gateway, which sets up the gateway for (say, ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. Note: For nfs to be available outside AWS, also requires [#6600] Add HDFS NFS gateway module to spark-ec2 Key: SPARK-6601 URL: https://issues.apache.org/jira/browse/SPARK-6601 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Florian Verhein Add module hdfs-nfs-gateway, which sets up the gateway for (say, ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes. Note: For nfs to be available outside AWS, also requires #6600 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6119) DataFrame.dropna support
[ https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385642#comment-14385642 ] Apache Spark commented on SPARK-6119: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/5248 DataFrame.dropna support Key: SPARK-6119 URL: https://issues.apache.org/jira/browse/SPARK-6119 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: DataFrame Support dropping rows with null values (dropna). Similar to Pandas' dropna http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.dropna.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385648#comment-14385648 ] Kannan Rajah commented on SPARK-1529: - I have pushed the first round of commits to my repo. I would like to get some early feedback on the overall design. https://github.com/rkannan82/spark/commits/dfs_shuffle Commits: https://github.com/rkannan82/spark/commit/ce8b430512b31e932ffdab6e0a2c1a6a1768ffbf https://github.com/rkannan82/spark/commit/8f5415c248c0a9ca5ad3ec9f48f839b24c259813 https://github.com/rkannan82/spark/commit/d9d179ba6c685cc8eb181f442e9bd6ad91cc4290 Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Kannan Rajah Attachments: Spark Shuffle using HDFS.pdf In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385706#comment-14385706 ] Sean Owen commented on SPARK-6593: -- At this level though, what's a bad split? a line of text that doesn't parse as expected? that's application-level logic. Given how little the framework knows, this would amount to ignoring a partition if there was any error in computing it, which seems too coarse to encourage people to use. You can of course handle this in the application logic -- catch the error, return nothing, log it, add to a counter, etc. Provide option for HadoopRDD to skip bad data splits. - Key: SPARK-6593 URL: https://issues.apache.org/jira/browse/SPARK-6593 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Dale Richardson Priority: Minor When reading a large amount of files from HDFS eg. with sc.textFile(hdfs:///user/cloudera/logs*.gz). If a single split is corrupted then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted portion and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4123) Show dependency changes in pull requests
[ https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4123. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5093 [https://github.com/apache/spark/pull/5093] Show dependency changes in pull requests Key: SPARK-4123 URL: https://issues.apache.org/jira/browse/SPARK-4123 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell Assignee: Brennon York Priority: Critical Fix For: 1.4.0 We should inspect the classpath of Spark's assembly jar for every pull request. This only takes a few seconds in Maven and it will help weed out dependency changes from the master branch. Ideally we'd post any dependency changes in the pull request message. {code} $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk -F/ '{print $NF}' | sort my-classpath $ git checkout apache/master $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk -F/ '{print $NF}' | sort master-classpath $ diff my-classpath master-classpath chill-java-0.3.6.jar chill_2.10-0.3.6.jar --- chill-java-0.5.0.jar chill_2.10-0.5.0.jar {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6406) Launcher backward compatibility issues
[ https://issues.apache.org/jira/browse/SPARK-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6406. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5085 [https://github.com/apache/spark/pull/5085] Launcher backward compatibility issues -- Key: SPARK-6406 URL: https://issues.apache.org/jira/browse/SPARK-6406 Project: Spark Issue Type: Improvement Components: Deploy Reporter: Nishkam Ravi Priority: Minor Fix For: 1.4.0 The new launcher library breaks backward compatibility. hadoop string in the spark assembly should not be mandatory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6406) Launcher backward compatibility issues
[ https://issues.apache.org/jira/browse/SPARK-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6406: - Assignee: Nishkam Ravi Launcher backward compatibility issues -- Key: SPARK-6406 URL: https://issues.apache.org/jira/browse/SPARK-6406 Project: Spark Issue Type: Improvement Components: Deploy Reporter: Nishkam Ravi Assignee: Nishkam Ravi Priority: Minor Fix For: 1.4.0 The new launcher library breaks backward compatibility. hadoop string in the spark assembly should not be mandatory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dale Richardson updated SPARK-6593: --- Description: When reading a large amount of gzip files from HDFS eg. with sc.textFile(hdfs:///user/cloudera/logs*.gz), If the hadoop input libraries report an exception then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted file and continue the job. was: When reading a large amount of files from HDFS eg. with sc.textFile(hdfs:///user/cloudera/logs*.gz). If the hadoop input libraries report an exception then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted file and continue the job. Provide option for HadoopRDD to skip corrupted files Key: SPARK-6593 URL: https://issues.apache.org/jira/browse/SPARK-6593 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Dale Richardson Priority: Minor When reading a large amount of gzip files from HDFS eg. with sc.textFile(hdfs:///user/cloudera/logs*.gz), If the hadoop input libraries report an exception then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted file and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5124. Resolution: Fixed Fix Version/s: 1.4.0 Standardize internal RPC interface -- Key: SPARK-5124 URL: https://issues.apache.org/jira/browse/SPARK-5124 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu Fix For: 1.4.0 Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf In Spark we use Akka as the RPC layer. It would be great if we can standardize the internal RPC interface to facilitate testing. This will also provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6594) Spark Streaming can't receive data from kafka
q79969786 created SPARK-6594: Summary: Spark Streaming can't receive data from kafka Key: SPARK-6594 URL: https://issues.apache.org/jira/browse/SPARK-6594 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1 Reporter: q79969786 I use KafkaUtils to receive data from Kafka In my Spark streaming application as follows: MapString, Integer topicorder = new HashMapString, Integer(); topicorder.put(order, Integer.valueOf(readThread)); JavaPairReceiverInputDStreamString, String jPRIDSOrder = KafkaUtils.createStream(jssc, zkQuorum, group, topicorder); It worked well at fist, but after I submit this application several times, Spark streaming can‘t receive data anymore(Kafka works well). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6594) Spark Streaming can't receive data from kafka
[ https://issues.apache.org/jira/browse/SPARK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385743#comment-14385743 ] Sean Owen commented on SPARK-6594: -- There's no useful detail here. Can you elaborate what exactly you are running, what you observe? what is the input, what are the states of the topics, what information leads you to believe streaming is not reading? It's certainly not true that they don't work in general. I am successfully using this exact combination now and have had no problems. Spark Streaming can't receive data from kafka - Key: SPARK-6594 URL: https://issues.apache.org/jira/browse/SPARK-6594 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1 Reporter: q79969786 I use KafkaUtils to receive data from Kafka In my Spark streaming application as follows: MapString, Integer topicorder = new HashMapString, Integer(); topicorder.put(order, Integer.valueOf(readThread)); JavaPairReceiverInputDStreamString, String jPRIDSOrder = KafkaUtils.createStream(jssc, zkQuorum, group, topicorder); It worked well at fist, but after I submit this application several times, Spark streaming can‘t receive data anymore(Kafka works well). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385890#comment-14385890 ] Steve Loughran commented on SPARK-1537: --- # I've just tried to see where YARN-2444 stands; I can't replicate it in trunk but I've submitted the tests to verify that it isn't there. # for YARN-2423 Spark seems kind of trapped. It needs an api tagged as public/stable; Robert's patch has the API, except it's being rejected on the basis that ATSv2 will break it. So it can't be tagged as stable. So there's no API for GET operations until some undefined time {{t1 now()}} —and then, only for Hadoop versions with it. Which implies it won't get picked up by Spark for a long time. I think we need to talk to the YARN dev team and see what can be done here. Even if there's no API client bundled into YARN, unless the v1 API and its paths beginning {{/ws/v1/timeline/}} are going to go away, then a REST client is possible; it may just have to be done spark-side, where at least it can be made resilient to hadoop versions. Add integration with Yarn's Application Timeline Server --- Key: SPARK-1537 URL: https://issues.apache.org/jira/browse/SPARK-1537 Project: Spark Issue Type: New Feature Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Attachments: SPARK-1537.txt, spark-1573.patch It would be nice to have Spark integrate with Yarn's Application Timeline Server (see YARN-321, YARN-1530). This would allow users running Spark on Yarn to have a single place to go for all their history needs, and avoid having to manage a separate service (Spark's built-in server). At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, although there is still some ongoing work. But the basics are there, and I wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6586. - Resolution: Not a Problem You can already get the original plan from a DataFrame: {{df.queryExecution.logical}}. Add the capability of retrieving original logical plan of DataFrame --- Key: SPARK-6586 URL: https://issues.apache.org/jira/browse/SPARK-6586 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan instead of logical plan. However, by doing that we can't know the logical plan of a {{DataFrame}}. But it might be still useful and important to retrieve the original logical plan in some use cases. In this pr, we introduce the capability of retrieving original logical plan of {{DataFrame}}. The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as {{true}}. In {{QueryExecution}}, we keep the original logical plan in the analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to recursively replace the analyzed logical plan with original logical plan and retrieve it. Besides the capability of retrieving original logical plan, this modification also can avoid do plan analysis if it is already analyzed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6548) Adding stddev to DataFrame functions
[ https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6548: Target Version/s: 1.4.0 Adding stddev to DataFrame functions Key: SPARK-6548 URL: https://issues.apache.org/jira/browse/SPARK-6548 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Labels: DataFrame, starter Add it to the list of aggregate functions: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala Also add it to https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala We can either add a Stddev Catalyst expression, or just compute it using existing functions like here: https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6592) API of Row trait should be presented in Scala doc
[ https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6592: Target Version/s: 1.4.0 API of Row trait should be presented in Scala doc - Key: SPARK-6592 URL: https://issues.apache.org/jira/browse/SPARK-6592 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Nan Zhu Priority: Critical Currently, the API of Row class is not presented in Scaladoc, though we have many chances to use it the reason is that we ignore all files under catalyst directly in SparkBuild.scala when generating Scaladoc, (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6592) API of Row trait should be presented in Scala doc
[ https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6592: Priority: Critical (was: Major) API of Row trait should be presented in Scala doc - Key: SPARK-6592 URL: https://issues.apache.org/jira/browse/SPARK-6592 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Nan Zhu Priority: Critical Currently, the API of Row class is not presented in Scaladoc, though we have many chances to use it the reason is that we ignore all files under catalyst directly in SparkBuild.scala when generating Scaladoc, (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message
[ https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6587: -- Description: (Don't know if this is a functionality bug, error reporting bug or an RFE ...) I define the following hierarchy: {code} private abstract class MyHolder private case class StringHolder(s: String) extends MyHolder private case class IntHolder(i: Int) extends MyHolder private case class BooleanHolder(b: Boolean) extends MyHolder {code} and a top level case class: {code} private case class Thing(key: Integer, foo: MyHolder) {code} When I try to convert it: {code} val things = Seq( Thing(1, IntHolder(42)), Thing(2, StringHolder(hello)), Thing(3, BooleanHolder(false)) ) val thingsDF = sc.parallelize(things, 4).toDF() thingsDF.registerTempTable(things) val all = sqlContext.sql(SELECT * from things) {code} I get the following stack trace: {noformat} Exception in thread main scala.MatchError: sql.CaseClassSchemaProblem.MyHolder (of class scala.reflect.internal.Types$ClassNoArgsTypeRef) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157) at scala.collection.immutable.List.map(List.scala:276) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312) at org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250) at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35) at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) {noformat} I wrote this to answer [a question on StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql] which uses a much simpler approach and suffers the same problem. Looking at what seems to me to be the [relevant unit test suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala] I see that this case is not covered. was: (Don't know if this is a functionality bug, error reporting bug or an RFE ...) I define the following hierarchy: {code} private abstract class MyHolder private case class StringHolder(s: String) extends MyHolder private case class IntHolder(i: Int) extends MyHolder private case class BooleanHolder(b: Boolean) extends MyHolder {code} and a top level case class: {code} private case class Thing(key: Integer, foo: MyHolder) {code} When I try to convert it: {code} val things = Seq( Thing(1, IntHolder(42)), Thing(2, StringHolder(hello)), Thing(3, BooleanHolder(false)) ) val thingsDF = sc.parallelize(things, 4).toDF() thingsDF.registerTempTable(things) val all = sqlContext.sql(SELECT * from things) {code} I get the following stack trace: {quote} Exception in thread main scala.MatchError: sql.CaseClassSchemaProblem.MyHolder (of class scala.reflect.internal.Types$ClassNoArgsTypeRef) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157) at scala.collection.immutable.List.map(List.scala:276) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107) at
[jira] [Commented] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message
[ https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385695#comment-14385695 ] Cheng Lian commented on SPARK-6587: --- This behavior is expected. There are two problems in your case: # Because {{things}} contains instances of all three case classes, the type of {{things}} is {{Seq[MyHolder]}}. Since {{MyHolder}} doesn't extend {{Product}}, can't be recognized by {{ScalaReflection}}. # You can only use a single concrete case class {{T}} when converting {{RDD[T]}} or {{Seq[T]}} to a DataFrame. For {{things}}, we can't figure out what data type should the {{foo}} field in the reflected schema have. Inferring schema for case class hierarchy fails with mysterious message --- Key: SPARK-6587 URL: https://issues.apache.org/jira/browse/SPARK-6587 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: At least Windows 8, Scala 2.11.2. Reporter: Spiro Michaylov (Don't know if this is a functionality bug, error reporting bug or an RFE ...) I define the following hierarchy: {code} private abstract class MyHolder private case class StringHolder(s: String) extends MyHolder private case class IntHolder(i: Int) extends MyHolder private case class BooleanHolder(b: Boolean) extends MyHolder {code} and a top level case class: {code} private case class Thing(key: Integer, foo: MyHolder) {code} When I try to convert it: {code} val things = Seq( Thing(1, IntHolder(42)), Thing(2, StringHolder(hello)), Thing(3, BooleanHolder(false)) ) val thingsDF = sc.parallelize(things, 4).toDF() thingsDF.registerTempTable(things) val all = sqlContext.sql(SELECT * from things) {code} I get the following stack trace: {noformat} Exception in thread main scala.MatchError: sql.CaseClassSchemaProblem.MyHolder (of class scala.reflect.internal.Types$ClassNoArgsTypeRef) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157) at scala.collection.immutable.List.map(List.scala:276) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312) at org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250) at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35) at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) {noformat} I wrote this to answer [a question on StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql] which uses a much simpler approach and suffers the same problem. Looking at what seems to me to be the [relevant unit test suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala] I see that this case is not covered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message
[ https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385695#comment-14385695 ] Cheng Lian edited comment on SPARK-6587 at 3/29/15 10:32 AM: - This behavior is expected. There are two problems in your case: # Because {{things}} contains instances of all three case classes, the type of {{things}} is {{Seq[MyHolder]}}. Since {{MyHolder}} doesn't extend {{Product}}, can't be recognized by {{ScalaReflection}}. # You can only use a single concrete case class {{T}} when converting {{RDD[T]}} or {{Seq[T]}} to a DataFrame. For {{things}}, we can't figure out what data type should the {{foo}} field in the reflected schema have. was (Author: lian cheng): This behavior is expected. There are two problems in your case: # Because {{things}} contains instances of all three case classes, the type of {{things}} is {{Seq[MyHolder]}}. Since {{MyHolder}} doesn't extend {{Product}}, can't be recognized by {{ScalaReflection}}. # You can only use a single concrete case class {{T}} when converting {{RDD[T]}} or {{Seq[T]}} to a DataFrame. For {{things}}, we can't figure out what data type should the {{foo}} field in the reflected schema have. Inferring schema for case class hierarchy fails with mysterious message --- Key: SPARK-6587 URL: https://issues.apache.org/jira/browse/SPARK-6587 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: At least Windows 8, Scala 2.11.2. Reporter: Spiro Michaylov (Don't know if this is a functionality bug, error reporting bug or an RFE ...) I define the following hierarchy: {code} private abstract class MyHolder private case class StringHolder(s: String) extends MyHolder private case class IntHolder(i: Int) extends MyHolder private case class BooleanHolder(b: Boolean) extends MyHolder {code} and a top level case class: {code} private case class Thing(key: Integer, foo: MyHolder) {code} When I try to convert it: {code} val things = Seq( Thing(1, IntHolder(42)), Thing(2, StringHolder(hello)), Thing(3, BooleanHolder(false)) ) val thingsDF = sc.parallelize(things, 4).toDF() thingsDF.registerTempTable(things) val all = sqlContext.sql(SELECT * from things) {code} I get the following stack trace: {noformat} Exception in thread main scala.MatchError: sql.CaseClassSchemaProblem.MyHolder (of class scala.reflect.internal.Types$ClassNoArgsTypeRef) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157) at scala.collection.immutable.List.map(List.scala:276) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312) at org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250) at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35) at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) {noformat} I wrote this to answer [a question on StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql] which uses a much simpler approach and suffers the same problem. Looking at what seems to me to be the [relevant unit test suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala] I see that this case is not covered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
[jira] [Created] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.
Dale Richardson created SPARK-6593: -- Summary: Provide option for HadoopRDD to skip bad data splits. Key: SPARK-6593 URL: https://issues.apache.org/jira/browse/SPARK-6593 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Dale Richardson Priority: Minor When reading a large amount of files from HDFS eg. with sc.textFile(hdfs:///user/cloudera/logs*.gz). If a single split is corrupted then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted portion and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6558) Utils.getCurrentUserName returns the full principal name instead of login name
[ https://issues.apache.org/jira/browse/SPARK-6558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6558. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5229 [https://github.com/apache/spark/pull/5229] Utils.getCurrentUserName returns the full principal name instead of login name -- Key: SPARK-6558 URL: https://issues.apache.org/jira/browse/SPARK-6558 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Priority: Critical Fix For: 1.4.0 Utils.getCurrentUserName returns UserGroupInformation.getCurrentUser().getUserName() when SPARK_USER isn't set. It should return UserGroupInformation.getCurrentUser().getShortUserName() getUserName() returns the users full principal name (ie us...@corp.com). getShortUserName() returns just the users login name (user1). This just happens to work on YARN because the Client code sets: env(SPARK_USER) = UserGroupInformation.getCurrentUser().getShortUserName() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6585) FileServerSuite.test (HttpFileServer should not work with SSL when the server is untrusted) failed is some evn.
[ https://issues.apache.org/jira/browse/SPARK-6585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6585. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5239 [https://github.com/apache/spark/pull/5239] FileServerSuite.test (HttpFileServer should not work with SSL when the server is untrusted) failed is some evn. - Key: SPARK-6585 URL: https://issues.apache.org/jira/browse/SPARK-6585 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.3.0 Reporter: June Priority: Minor Fix For: 1.4.0 In my test machine, FileServerSuite.test (HttpFileServer should not work with SSL when the server is untrusted) case throw SSLException not SSLHandshakeException, suggest change to catch SSLException to improve test case 's robustness. [info] - HttpFileServer should not work with SSL when the server is untrusted *** FAILED *** (69 milliseconds) [info] Expected exception javax.net.ssl.SSLHandshakeException to be thrown, but javax.net.ssl.SSLException was thrown. (FileServerSuite.scala:231) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) [info] at org.scalatest.Assertions$class.intercept(Assertions.scala:1004) [info] at org.scalatest.FunSuite.intercept(FunSuite.scala:1555) [info] at org.apache.spark.FileServerSuite$$anonfun$15.apply$mcV$sp(FileServerSuite.scala:231) [info] at org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224) [info] at org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) [info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) [info] at org.apache.spark.FileServerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(FileServerSuite.scala:34) [info] at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dale Richardson updated SPARK-6593: --- Summary: Provide option for HadoopRDD to skip corrupted files (was: Provide option for HadoopRDD to skip bad data splits.) Provide option for HadoopRDD to skip corrupted files Key: SPARK-6593 URL: https://issues.apache.org/jira/browse/SPARK-6593 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Dale Richardson Priority: Minor When reading a large amount of files from HDFS eg. with sc.textFile(hdfs:///user/cloudera/logs*.gz). If the hadoop input libraries report an exception then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted portion and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385723#comment-14385723 ] Dale Richardson commented on SPARK-6593: Changed the title and description to focus closer on my particular use case, which is corrupted gzip files. Provide option for HadoopRDD to skip corrupted files Key: SPARK-6593 URL: https://issues.apache.org/jira/browse/SPARK-6593 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Dale Richardson Priority: Minor When reading a large amount of files from HDFS eg. with sc.textFile(hdfs:///user/cloudera/logs*.gz). If the hadoop input libraries report an exception then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted portion and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6585) FileServerSuite.test (HttpFileServer should not work with SSL when the server is untrusted) failed is some evn.
[ https://issues.apache.org/jira/browse/SPARK-6585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6585: - Assignee: June FileServerSuite.test (HttpFileServer should not work with SSL when the server is untrusted) failed is some evn. - Key: SPARK-6585 URL: https://issues.apache.org/jira/browse/SPARK-6585 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.3.0 Reporter: June Assignee: June Priority: Minor Fix For: 1.4.0 In my test machine, FileServerSuite.test (HttpFileServer should not work with SSL when the server is untrusted) case throw SSLException not SSLHandshakeException, suggest change to catch SSLException to improve test case 's robustness. [info] - HttpFileServer should not work with SSL when the server is untrusted *** FAILED *** (69 milliseconds) [info] Expected exception javax.net.ssl.SSLHandshakeException to be thrown, but javax.net.ssl.SSLException was thrown. (FileServerSuite.scala:231) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) [info] at org.scalatest.Assertions$class.intercept(Assertions.scala:1004) [info] at org.scalatest.FunSuite.intercept(FunSuite.scala:1555) [info] at org.apache.spark.FileServerSuite$$anonfun$15.apply$mcV$sp(FileServerSuite.scala:231) [info] at org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224) [info] at org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) [info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) [info] at org.apache.spark.FileServerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(FileServerSuite.scala:34) [info] at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dale Richardson updated SPARK-6593: --- Description: When reading a large amount of files from HDFS eg. with sc.textFile(hdfs:///user/cloudera/logs*.gz). If the hadoop input libraries report an exception then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted file and continue the job. was: When reading a large amount of files from HDFS eg. with sc.textFile(hdfs:///user/cloudera/logs*.gz). If the hadoop input libraries report an exception then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted portion and continue the job. Provide option for HadoopRDD to skip corrupted files Key: SPARK-6593 URL: https://issues.apache.org/jira/browse/SPARK-6593 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Dale Richardson Priority: Minor When reading a large amount of files from HDFS eg. with sc.textFile(hdfs:///user/cloudera/logs*.gz). If the hadoop input libraries report an exception then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted file and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6580) Optimize LogisticRegressionModel.predictPoint
[ https://issues.apache.org/jira/browse/SPARK-6580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6580: --- Assignee: (was: Apache Spark) Optimize LogisticRegressionModel.predictPoint - Key: SPARK-6580 URL: https://issues.apache.org/jira/browse/SPARK-6580 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor LogisticRegressionModel.predictPoint could be optimized some. There are several checks which could be moved outside loops or even outside predictPoint to initialization of the model. Some include: {code} require(numFeatures == weightMatrix.size) val dataWithBiasSize = weightMatrix.size / (numClasses - 1) val weightsArray = weightMatrix match { ... if (dataMatrix.size + 1 == dataWithBiasSize) {... {code} Also, for multiclass, the 2 loops (over numClasses and margins) could be combined into 1 loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6580) Optimize LogisticRegressionModel.predictPoint
[ https://issues.apache.org/jira/browse/SPARK-6580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385664#comment-14385664 ] Apache Spark commented on SPARK-6580: - User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/5249 Optimize LogisticRegressionModel.predictPoint - Key: SPARK-6580 URL: https://issues.apache.org/jira/browse/SPARK-6580 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor LogisticRegressionModel.predictPoint could be optimized some. There are several checks which could be moved outside loops or even outside predictPoint to initialization of the model. Some include: {code} require(numFeatures == weightMatrix.size) val dataWithBiasSize = weightMatrix.size / (numClasses - 1) val weightsArray = weightMatrix match { ... if (dataMatrix.size + 1 == dataWithBiasSize) {... {code} Also, for multiclass, the 2 loops (over numClasses and margins) could be combined into 1 loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385715#comment-14385715 ] Apache Spark commented on SPARK-6593: - User 'tigerquoll' has created a pull request for this issue: https://github.com/apache/spark/pull/5250 Provide option for HadoopRDD to skip bad data splits. - Key: SPARK-6593 URL: https://issues.apache.org/jira/browse/SPARK-6593 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Dale Richardson Priority: Minor When reading a large amount of files from HDFS eg. with sc.textFile(hdfs:///user/cloudera/logs*.gz). If a single split is corrupted then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted portion and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6593: --- Assignee: (was: Apache Spark) Provide option for HadoopRDD to skip bad data splits. - Key: SPARK-6593 URL: https://issues.apache.org/jira/browse/SPARK-6593 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Dale Richardson Priority: Minor When reading a large amount of files from HDFS eg. with sc.textFile(hdfs:///user/cloudera/logs*.gz). If a single split is corrupted then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted portion and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6593: --- Assignee: Apache Spark Provide option for HadoopRDD to skip bad data splits. - Key: SPARK-6593 URL: https://issues.apache.org/jira/browse/SPARK-6593 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Dale Richardson Assignee: Apache Spark Priority: Minor When reading a large amount of files from HDFS eg. with sc.textFile(hdfs:///user/cloudera/logs*.gz). If a single split is corrupted then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted portion and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc
[ https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385724#comment-14385724 ] Nan Zhu commented on SPARK-6592: ? I don't think that makes any difference, as the path of Row.scala still contains spark/sql/catalyst? I also tried to rerun build/sbt doc, the same thing... maybe we need to hack SparkBuild.scala to exclude Row.scala? API of Row trait should be presented in Scala doc - Key: SPARK-6592 URL: https://issues.apache.org/jira/browse/SPARK-6592 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Nan Zhu Currently, the API of Row class is not presented in Scaladoc, though we have many chances to use it the reason is that we ignore all files under catalyst directly in SparkBuild.scala when generating Scaladoc, (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385716#comment-14385716 ] Dale Richardson edited comment on SPARK-6593 at 3/29/15 11:35 AM: -- With a gz file for example, the entire file is a split. so a corrupted gz file will kill the entire job - with no way of catching and remediating the error. was (Author: tigerquoll): With a gz file for example, the entire file is a split. so a corrupted gz file will kill the entire job. Provide option for HadoopRDD to skip bad data splits. - Key: SPARK-6593 URL: https://issues.apache.org/jira/browse/SPARK-6593 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Dale Richardson Priority: Minor When reading a large amount of files from HDFS eg. with sc.textFile(hdfs:///user/cloudera/logs*.gz). If a single split is corrupted then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted portion and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385716#comment-14385716 ] Dale Richardson commented on SPARK-6593: With a gz file for example, the entire file is a split. so a corrupted gz file will kill the entire job. Provide option for HadoopRDD to skip bad data splits. - Key: SPARK-6593 URL: https://issues.apache.org/jira/browse/SPARK-6593 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Dale Richardson Priority: Minor When reading a large amount of files from HDFS eg. with sc.textFile(hdfs:///user/cloudera/logs*.gz). If a single split is corrupted then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted portion and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6580) Optimize LogisticRegressionModel.predictPoint
[ https://issues.apache.org/jira/browse/SPARK-6580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6580: --- Assignee: Apache Spark Optimize LogisticRegressionModel.predictPoint - Key: SPARK-6580 URL: https://issues.apache.org/jira/browse/SPARK-6580 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Apache Spark Priority: Minor LogisticRegressionModel.predictPoint could be optimized some. There are several checks which could be moved outside loops or even outside predictPoint to initialization of the model. Some include: {code} require(numFeatures == weightMatrix.size) val dataWithBiasSize = weightMatrix.size / (numClasses - 1) val weightsArray = weightMatrix match { ... if (dataMatrix.size + 1 == dataWithBiasSize) {... {code} Also, for multiclass, the 2 loops (over numClasses and margins) could be combined into 1 loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6579) save as parquet with overwrite failed
[ https://issues.apache.org/jira/browse/SPARK-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385694#comment-14385694 ] Cheng Lian commented on SPARK-6579: --- Here's another Parquet issue with Hadoop 1.0.4: SPARK-6581. save as parquet with overwrite failed - Key: SPARK-6579 URL: https://issues.apache.org/jira/browse/SPARK-6579 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.4.0 Reporter: Davies Liu Assignee: Michael Armbrust Priority: Critical {code} df = sc.parallelize(xrange(n), 4).map(lambda x: (x, str(x) * 2,)).toDF(['int', 'str']) df.save(test_data, source=parquet, mode='overwrite') df.save(test_data, source=parquet, mode='overwrite') {code} it failed with: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 3.0 failed 1 times, most recent failure: Lost task 3.0 in stage 3.0 (TID 6, localhost): java.lang.IllegalArgumentException: You cannot call toBytes() more than once without calling reset() at parquet.Preconditions.checkArgument(Preconditions.java:47) at parquet.column.values.rle.RunLengthBitPackingHybridEncoder.toBytes(RunLengthBitPackingHybridEncoder.java:254) at parquet.column.values.rle.RunLengthBitPackingHybridValuesWriter.getBytes(RunLengthBitPackingHybridValuesWriter.java:68) at parquet.column.impl.ColumnWriterImpl.writePage(ColumnWriterImpl.java:147) at parquet.column.impl.ColumnWriterImpl.flush(ColumnWriterImpl.java:236) at parquet.column.impl.ColumnWriteStoreImpl.flush(ColumnWriteStoreImpl.java:113) at parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153) at parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112) at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:663) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:677) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:677) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:212) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1211) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1200) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1199) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1199) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1399) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1360) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {code} run it again, it failed with: {code} 15/03/27 13:26:16 WARN FSInputChecker: Problem opening checksum file: file:/Users/davies/work/spark/tmp/test_data/_temporary/_attempt_201503271324_0011_r_03_0/part-r-4.parquet. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:134) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:402) at
[jira] [Updated] (SPARK-6579) save as parquet with overwrite failed when linking with Hadoop 1.0.4
[ https://issues.apache.org/jira/browse/SPARK-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6579: -- Summary: save as parquet with overwrite failed when linking with Hadoop 1.0.4 (was: save as parquet with overwrite failed) save as parquet with overwrite failed when linking with Hadoop 1.0.4 Key: SPARK-6579 URL: https://issues.apache.org/jira/browse/SPARK-6579 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.4.0 Reporter: Davies Liu Assignee: Michael Armbrust Priority: Critical {code} df = sc.parallelize(xrange(n), 4).map(lambda x: (x, str(x) * 2,)).toDF(['int', 'str']) df.save(test_data, source=parquet, mode='overwrite') df.save(test_data, source=parquet, mode='overwrite') {code} it failed with: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 3.0 failed 1 times, most recent failure: Lost task 3.0 in stage 3.0 (TID 6, localhost): java.lang.IllegalArgumentException: You cannot call toBytes() more than once without calling reset() at parquet.Preconditions.checkArgument(Preconditions.java:47) at parquet.column.values.rle.RunLengthBitPackingHybridEncoder.toBytes(RunLengthBitPackingHybridEncoder.java:254) at parquet.column.values.rle.RunLengthBitPackingHybridValuesWriter.getBytes(RunLengthBitPackingHybridValuesWriter.java:68) at parquet.column.impl.ColumnWriterImpl.writePage(ColumnWriterImpl.java:147) at parquet.column.impl.ColumnWriterImpl.flush(ColumnWriterImpl.java:236) at parquet.column.impl.ColumnWriteStoreImpl.flush(ColumnWriteStoreImpl.java:113) at parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153) at parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112) at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:663) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:677) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:677) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:212) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1211) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1200) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1199) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1199) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1399) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1360) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {code} run it again, it failed with: {code} 15/03/27 13:26:16 WARN FSInputChecker: Problem opening checksum file: file:/Users/davies/work/spark/tmp/test_data/_temporary/_attempt_201503271324_0011_r_03_0/part-r-4.parquet. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:134) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) at
[jira] [Updated] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dale Richardson updated SPARK-6593: --- Description: When reading a large amount of files from HDFS eg. with sc.textFile(hdfs:///user/cloudera/logs*.gz). If the hadoop input libraries report an exception then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted portion and continue the job. was: When reading a large amount of files from HDFS eg. with sc.textFile(hdfs:///user/cloudera/logs*.gz). If a single split is corrupted then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted portion and continue the job. Provide option for HadoopRDD to skip bad data splits. - Key: SPARK-6593 URL: https://issues.apache.org/jira/browse/SPARK-6593 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Dale Richardson Priority: Minor When reading a large amount of files from HDFS eg. with sc.textFile(hdfs:///user/cloudera/logs*.gz). If the hadoop input libraries report an exception then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted portion and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6595) DataFrame self joins with MetastoreRelations fail
[ https://issues.apache.org/jira/browse/SPARK-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385924#comment-14385924 ] Apache Spark commented on SPARK-6595: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/5251 DataFrame self joins with MetastoreRelations fail - Key: SPARK-6595 URL: https://issues.apache.org/jira/browse/SPARK-6595 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6595) DataFrame self joins with MetastoreRelations fail
[ https://issues.apache.org/jira/browse/SPARK-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6595: --- Assignee: Michael Armbrust (was: Apache Spark) DataFrame self joins with MetastoreRelations fail - Key: SPARK-6595 URL: https://issues.apache.org/jira/browse/SPARK-6595 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6595) DataFrame self joins with MetastoreRelations fail
[ https://issues.apache.org/jira/browse/SPARK-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6595: --- Assignee: Apache Spark (was: Michael Armbrust) DataFrame self joins with MetastoreRelations fail - Key: SPARK-6595 URL: https://issues.apache.org/jira/browse/SPARK-6595 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Michael Armbrust Assignee: Apache Spark Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc
[ https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385964#comment-14385964 ] Nan Zhu commented on SPARK-6592: it contains the reason is that the input of that line is file.getCanonicalPath...which output the absolute path e.g. {code} scala val f = new java.io.File(Row.class) f: java.io.File = Row.class scala f.getCanonicalPath res0: String = /Users/nanzhu/code/spark/sql/catalyst/target/scala-2.10/classes/org/apache/spark/sql/Row.class {code} API of Row trait should be presented in Scala doc - Key: SPARK-6592 URL: https://issues.apache.org/jira/browse/SPARK-6592 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Nan Zhu Priority: Critical Currently, the API of Row class is not presented in Scaladoc, though we have many chances to use it the reason is that we ignore all files under catalyst directly in SparkBuild.scala when generating Scaladoc, (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc
[ https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385895#comment-14385895 ] Reynold Xin commented on SPARK-6592: Row.html/class doesn't contain the word catalyst, does it? ./api/java/org/apache/spark/sql/Row.html API of Row trait should be presented in Scala doc - Key: SPARK-6592 URL: https://issues.apache.org/jira/browse/SPARK-6592 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Nan Zhu Priority: Critical Currently, the API of Row class is not presented in Scaladoc, though we have many chances to use it the reason is that we ignore all files under catalyst directly in SparkBuild.scala when generating Scaladoc, (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6595) DataFrame self joins with MetastoreRelations fail
Michael Armbrust created SPARK-6595: --- Summary: DataFrame self joins with MetastoreRelations fail Key: SPARK-6595 URL: https://issues.apache.org/jira/browse/SPARK-6595 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc
[ https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385970#comment-14385970 ] Reynold Xin commented on SPARK-6592: Ok then can't you just add apache to it? API of Row trait should be presented in Scala doc - Key: SPARK-6592 URL: https://issues.apache.org/jira/browse/SPARK-6592 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Nan Zhu Priority: Critical Currently, the API of Row class is not presented in Scaladoc, though we have many chances to use it the reason is that we ignore all files under catalyst directly in SparkBuild.scala when generating Scaladoc, (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6592) API of Row trait should be presented in Scala doc
[ https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6592: --- Assignee: (was: Apache Spark) API of Row trait should be presented in Scala doc - Key: SPARK-6592 URL: https://issues.apache.org/jira/browse/SPARK-6592 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Nan Zhu Priority: Critical Currently, the API of Row class is not presented in Scaladoc, though we have many chances to use it the reason is that we ignore all files under catalyst directly in SparkBuild.scala when generating Scaladoc, (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc
[ https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385991#comment-14385991 ] Apache Spark commented on SPARK-6592: - User 'CodingCat' has created a pull request for this issue: https://github.com/apache/spark/pull/5252 API of Row trait should be presented in Scala doc - Key: SPARK-6592 URL: https://issues.apache.org/jira/browse/SPARK-6592 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Nan Zhu Priority: Critical Currently, the API of Row class is not presented in Scaladoc, though we have many chances to use it the reason is that we ignore all files under catalyst directly in SparkBuild.scala when generating Scaladoc, (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6592) API of Row trait should be presented in Scala doc
[ https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6592: --- Assignee: Apache Spark API of Row trait should be presented in Scala doc - Key: SPARK-6592 URL: https://issues.apache.org/jira/browse/SPARK-6592 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Nan Zhu Assignee: Apache Spark Priority: Critical Currently, the API of Row class is not presented in Scaladoc, though we have many chances to use it the reason is that we ignore all files under catalyst directly in SparkBuild.scala when generating Scaladoc, (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6596) fix the instruction on building scaladoc
Nan Zhu created SPARK-6596: -- Summary: fix the instruction on building scaladoc Key: SPARK-6596 URL: https://issues.apache.org/jira/browse/SPARK-6596 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Nan Zhu In README.md under docs/ directory, it says that You can build just the Spark scaladoc by running build/sbt doc from the SPARK_PROJECT_ROOT directory. I guess the right approach is build/sbt unidoc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6596) fix the instruction on building scaladoc
[ https://issues.apache.org/jira/browse/SPARK-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385993#comment-14385993 ] Apache Spark commented on SPARK-6596: - User 'CodingCat' has created a pull request for this issue: https://github.com/apache/spark/pull/5253 fix the instruction on building scaladoc - Key: SPARK-6596 URL: https://issues.apache.org/jira/browse/SPARK-6596 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Nan Zhu In README.md under docs/ directory, it says that You can build just the Spark scaladoc by running build/sbt doc from the SPARK_PROJECT_ROOT directory. I guess the right approach is build/sbt unidoc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6596) fix the instruction on building scaladoc
[ https://issues.apache.org/jira/browse/SPARK-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6596: --- Assignee: (was: Apache Spark) fix the instruction on building scaladoc - Key: SPARK-6596 URL: https://issues.apache.org/jira/browse/SPARK-6596 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Nan Zhu In README.md under docs/ directory, it says that You can build just the Spark scaladoc by running build/sbt doc from the SPARK_PROJECT_ROOT directory. I guess the right approach is build/sbt unidoc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6595) DataFrame self joins with MetastoreRelations fail
[ https://issues.apache.org/jira/browse/SPARK-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6595: Target Version/s: 1.3.1, 1.4.0 (was: 1.3.1) DataFrame self joins with MetastoreRelations fail - Key: SPARK-6595 URL: https://issues.apache.org/jira/browse/SPARK-6595 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6596) fix the instruction on building scaladoc
[ https://issues.apache.org/jira/browse/SPARK-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6596: --- Assignee: Apache Spark fix the instruction on building scaladoc - Key: SPARK-6596 URL: https://issues.apache.org/jira/browse/SPARK-6596 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Nan Zhu Assignee: Apache Spark In README.md under docs/ directory, it says that You can build just the Spark scaladoc by running build/sbt doc from the SPARK_PROJECT_ROOT directory. I guess the right approach is build/sbt unidoc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386009#comment-14386009 ] Liang-Chi Hsieh commented on SPARK-6586: Not true. Because DataFrame now is given analyzed plan after its many operations, {{df.queryExecution.logical}} is analyzed plan instead of the original logical plan. Add the capability of retrieving original logical plan of DataFrame --- Key: SPARK-6586 URL: https://issues.apache.org/jira/browse/SPARK-6586 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan instead of logical plan. However, by doing that we can't know the logical plan of a {{DataFrame}}. But it might be still useful and important to retrieve the original logical plan in some use cases. In this pr, we introduce the capability of retrieving original logical plan of {{DataFrame}}. The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as {{true}}. In {{QueryExecution}}, we keep the original logical plan in the analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to recursively replace the analyzed logical plan with original logical plan and retrieve it. Besides the capability of retrieving original logical plan, this modification also can avoid do plan analysis if it is already analyzed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh reopened SPARK-6586: Add the capability of retrieving original logical plan of DataFrame --- Key: SPARK-6586 URL: https://issues.apache.org/jira/browse/SPARK-6586 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan instead of logical plan. However, by doing that we can't know the logical plan of a {{DataFrame}}. But it might be still useful and important to retrieve the original logical plan in some use cases. In this pr, we introduce the capability of retrieving original logical plan of {{DataFrame}}. The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as {{true}}. In {{QueryExecution}}, we keep the original logical plan in the analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to recursively replace the analyzed logical plan with original logical plan and retrieve it. Besides the capability of retrieving original logical plan, this modification also can avoid do plan analysis if it is already analyzed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386009#comment-14386009 ] Liang-Chi Hsieh edited comment on SPARK-6586 at 3/29/15 11:24 PM: -- Not true. Because DataFrame now is given analyzed plan after its many operations, {{df.queryExecution.logical}} is analyzed plan instead of the original logical plan. You can check the pr #5217 for the modification. was (Author: viirya): Not true. Because DataFrame now is given analyzed plan after its many operations, {{df.queryExecution.logical}} is analyzed plan instead of the original logical plan. Add the capability of retrieving original logical plan of DataFrame --- Key: SPARK-6586 URL: https://issues.apache.org/jira/browse/SPARK-6586 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan instead of logical plan. However, by doing that we can't know the logical plan of a {{DataFrame}}. But it might be still useful and important to retrieve the original logical plan in some use cases. In this pr, we introduce the capability of retrieving original logical plan of {{DataFrame}}. The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as {{true}}. In {{QueryExecution}}, we keep the original logical plan in the analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to recursively replace the analyzed logical plan with original logical plan and retrieve it. Besides the capability of retrieving original logical plan, this modification also can avoid do plan analysis if it is already analyzed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6597) Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js
Kousuke Saruta created SPARK-6597: - Summary: Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js Key: SPARK-6597 URL: https://issues.apache.org/jira/browse/SPARK-6597 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.2, 1.3.1, 1.4.0 Reporter: Kousuke Saruta Priority: Minor In additional-metrics.js, there are some selector notation like `input:checkbox` but JQuery's official document says `input[type=checkbox]` is better. https://api.jquery.com/checkbox-selector/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6597) Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js
[ https://issues.apache.org/jira/browse/SPARK-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6597: --- Assignee: Apache Spark Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js -- Key: SPARK-6597 URL: https://issues.apache.org/jira/browse/SPARK-6597 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.2, 1.3.1, 1.4.0 Reporter: Kousuke Saruta Assignee: Apache Spark Priority: Minor In additional-metrics.js, there are some selector notation like `input:checkbox` but JQuery's official document says `input[type=checkbox]` is better. https://api.jquery.com/checkbox-selector/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6597) Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js
[ https://issues.apache.org/jira/browse/SPARK-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6597: --- Assignee: (was: Apache Spark) Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js -- Key: SPARK-6597 URL: https://issues.apache.org/jira/browse/SPARK-6597 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.2, 1.3.1, 1.4.0 Reporter: Kousuke Saruta Priority: Minor In additional-metrics.js, there are some selector notation like `input:checkbox` but JQuery's official document says `input[type=checkbox]` is better. https://api.jquery.com/checkbox-selector/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6597) Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js
[ https://issues.apache.org/jira/browse/SPARK-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386017#comment-14386017 ] Apache Spark commented on SPARK-6597: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/5254 Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js -- Key: SPARK-6597 URL: https://issues.apache.org/jira/browse/SPARK-6597 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.2, 1.3.1, 1.4.0 Reporter: Kousuke Saruta Priority: Minor In additional-metrics.js, there are some selector notation like `input:checkbox` but JQuery's official document says `input[type=checkbox]` is better. https://api.jquery.com/checkbox-selector/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386020#comment-14386020 ] Michael Armbrust commented on SPARK-6586: - Okay, but what is the utility of keeping a fully unresolved plan around? You are just complicating data frame with a bunch of mutable state. Add the capability of retrieving original logical plan of DataFrame --- Key: SPARK-6586 URL: https://issues.apache.org/jira/browse/SPARK-6586 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan instead of logical plan. However, by doing that we can't know the logical plan of a {{DataFrame}}. But it might be still useful and important to retrieve the original logical plan in some use cases. In this pr, we introduce the capability of retrieving original logical plan of {{DataFrame}}. The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as {{true}}. In {{QueryExecution}}, we keep the original logical plan in the analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to recursively replace the analyzed logical plan with original logical plan and retrieve it. Besides the capability of retrieving original logical plan, this modification also can avoid do plan analysis if it is already analyzed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org