[jira] [Created] (SPARK-6599) Add Kinesis Direct API

2015-03-29 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-6599:


 Summary: Add Kinesis Direct API
 Key: SPARK-6599
 URL: https://issues.apache.org/jira/browse/SPARK-6599
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6369) InsertIntoHiveTable and Parquet Relation should use logic from SparkHadoopWriter

2015-03-29 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6369:

Summary: InsertIntoHiveTable and Parquet Relation should use logic from 
SparkHadoopWriter  (was: InsertIntoHiveTable should use logic from 
SparkHadoopWriter)

 InsertIntoHiveTable and Parquet Relation should use logic from 
 SparkHadoopWriter
 

 Key: SPARK-6369
 URL: https://issues.apache.org/jira/browse/SPARK-6369
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker

 Right now it is possible that we will corrupt the output if there is a race 
 between competing speculative tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame

2015-03-29 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh closed SPARK-6586.
--
Resolution: Not a Problem

 Add the capability of retrieving original logical plan of DataFrame
 ---

 Key: SPARK-6586
 URL: https://issues.apache.org/jira/browse/SPARK-6586
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor

 In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan 
 instead of logical plan. However, by doing that we can't know the logical 
 plan of a {{DataFrame}}. But it might be still useful and important to 
 retrieve the original logical plan in some use cases.
 In this pr, we introduce the capability of retrieving original logical plan 
 of {{DataFrame}}.
 The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once 
 {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as 
 {{true}}.  In {{QueryExecution}}, we keep the original logical plan in the 
 analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to 
 recursively replace the analyzed logical plan with original logical plan and 
 retrieve it.
 Besides the capability of retrieving original logical plan, this modification 
 also can avoid do plan analysis if it is already analyzed.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame

2015-03-29 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386209#comment-14386209
 ] 

Liang-Chi Hsieh commented on SPARK-6586:


I am no problem with your opinion. But If that is true, we don't need to keep 
logical plan in queryExecution now. I am closing this, thanks.

 Add the capability of retrieving original logical plan of DataFrame
 ---

 Key: SPARK-6586
 URL: https://issues.apache.org/jira/browse/SPARK-6586
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor

 In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan 
 instead of logical plan. However, by doing that we can't know the logical 
 plan of a {{DataFrame}}. But it might be still useful and important to 
 retrieve the original logical plan in some use cases.
 In this pr, we introduce the capability of retrieving original logical plan 
 of {{DataFrame}}.
 The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once 
 {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as 
 {{true}}.  In {{QueryExecution}}, we keep the original logical plan in the 
 analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to 
 recursively replace the analyzed logical plan with original logical plan and 
 retrieve it.
 Besides the capability of retrieving original logical plan, this modification 
 also can avoid do plan analysis if it is already analyzed.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5203) union with different decimal type report error

2015-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5203:
---

Assignee: (was: Apache Spark)

 union with different decimal type report error
 --

 Key: SPARK-5203
 URL: https://issues.apache.org/jira/browse/SPARK-5203
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: guowei

 Test case like this:
 {code:sql}
 create table test (a decimal(10,1));
 select a from test union all select a*2 from test;
 {code}
 Exception thown:
 {noformat}
 15/01/12 16:28:54 ERROR SparkSQLDriver: Failed in [select a from test union 
 all select a*2 from test]
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
 attributes: *, tree:
 'Project [*]
  'Subquery _u1
   'Union 
Project [a#1]
 MetastoreRelation default, test, None
Project [CAST((CAST(a#2, DecimalType()) * CAST(CAST(2, DecimalType(10,0)), 
 DecimalType())), DecimalType(21,1)) AS _c0#0]
 MetastoreRelation default, test, None
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:85)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:83)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:81)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
   at 
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
   at 
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
   at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:410)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:410)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:412)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:412)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421)
   at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:369)
   at 
 org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:58)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame

2015-03-29 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386068#comment-14386068
 ] 

Liang-Chi Hsieh commented on SPARK-6586:


Even just for debuging purpose, I think it is important to provide a way to 
access the logical plan.

The mutable states this added are limited in Logical Plan. If mutable states 
are bad here, I think I can refactor this to one without mutable states.

 Add the capability of retrieving original logical plan of DataFrame
 ---

 Key: SPARK-6586
 URL: https://issues.apache.org/jira/browse/SPARK-6586
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor

 In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan 
 instead of logical plan. However, by doing that we can't know the logical 
 plan of a {{DataFrame}}. But it might be still useful and important to 
 retrieve the original logical plan in some use cases.
 In this pr, we introduce the capability of retrieving original logical plan 
 of {{DataFrame}}.
 The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once 
 {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as 
 {{true}}.  In {{QueryExecution}}, we keep the original logical plan in the 
 analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to 
 recursively replace the analyzed logical plan with original logical plan and 
 retrieve it.
 Besides the capability of retrieving original logical plan, this modification 
 also can avoid do plan analysis if it is already analyzed.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6601) Add HDFS NFS gateway module to spark-ec2

2015-03-29 Thread Florian Verhein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-6601:
---
Description: 
Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.

Note: For nfs to be available outside AWS, also requires [#6600]

  was:

Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.

Note: For nfs to be available outside AWS, also requires #6600


 Add HDFS NFS gateway module to spark-ec2
 

 Key: SPARK-6601
 URL: https://issues.apache.org/jira/browse/SPARK-6601
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein

 Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
 ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.
 Note: For nfs to be available outside AWS, also requires [#6600]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6601) Add HDFS NFS gateway module to spark-ec2

2015-03-29 Thread Florian Verhein (JIRA)
Florian Verhein created SPARK-6601:
--

 Summary: Add HDFS NFS gateway module to spark-ec2
 Key: SPARK-6601
 URL: https://issues.apache.org/jira/browse/SPARK-6601
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein



Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.

Note: For nfs to be available outside AWS, also requires #6600



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5203) union with different decimal type report error

2015-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5203:
---

Assignee: Apache Spark

 union with different decimal type report error
 --

 Key: SPARK-5203
 URL: https://issues.apache.org/jira/browse/SPARK-5203
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: guowei
Assignee: Apache Spark

 Test case like this:
 {code:sql}
 create table test (a decimal(10,1));
 select a from test union all select a*2 from test;
 {code}
 Exception thown:
 {noformat}
 15/01/12 16:28:54 ERROR SparkSQLDriver: Failed in [select a from test union 
 all select a*2 from test]
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
 attributes: *, tree:
 'Project [*]
  'Subquery _u1
   'Union 
Project [a#1]
 MetastoreRelation default, test, None
Project [CAST((CAST(a#2, DecimalType()) * CAST(CAST(2, DecimalType(10,0)), 
 DecimalType())), DecimalType(21,1)) AS _c0#0]
 MetastoreRelation default, test, None
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:85)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:83)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:81)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
   at 
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
   at 
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
   at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:410)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:410)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:412)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:412)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421)
   at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:369)
   at 
 org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:58)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6598) Python API for IDFModel

2015-03-29 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-6598:
-

 Summary: Python API for IDFModel
 Key: SPARK-6598
 URL: https://issues.apache.org/jira/browse/SPARK-6598
 Project: Spark
  Issue Type: Task
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Kai Sasaki
Priority: Minor


This is the sub-task of 
[SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254].

Wrap IDFModel {{idf}} member function for pyspark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6594) Spark Streaming can't receive data from kafka

2015-03-29 Thread q79969786 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386071#comment-14386071
 ] 

q79969786 commented on SPARK-6594:
--

1. Create kafka topic as follow:
$kafka-topics.sh --create --zookeeper 
zkServer1:2181,zkServer2:2181,zkServer3:2181 --replication-factor 1 
--partitions 5 --topic ORDER
2. I'm use Java API to process data as follow:
SparkConf sparkConf = new SparkConf().setAppName(TestOrder);
sparkConf.set(spark.cleaner.ttl, 600);
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new 
Duration(1000));
MapString, Integer topicorder = new HashMapString, Integer();
topicorder.put(order, 5);
JavaPairReceiverInputDStreamString, String jPRIDSOrder = 
KafkaUtils.createStream(jssc, ’zkServer1:2181,zkServer2:2181,zkServer3:2181‘, 
’test-consumer-group‘, topicorder);
jPRIDSOrder.map(new FunctionTuple2String, String, String() {
@Override
public String call(Tuple2String, String tuple2)
{ return tuple2._2(); }
}).print();
3. Submit this application as follow:
spark-submit --class com.bigdata.TestOrder --master spark://SPKMASTER:19002 
/home/bigdata/test-spark.jar TestOrder
4. It will shows five warnings as follows when submit application:
15/03/29 21:23:03 WARN ZookeeperConsumerConnector: 
[test-consumer-group_work1-1427462582342-5714642d], No broker partitions 
consumed by consumer thread test-consumer-group_work1-1427462582342-5714642d-0 
for topic ORDER
..

 Spark Streaming can't receive data from kafka
 -

 Key: SPARK-6594
 URL: https://issues.apache.org/jira/browse/SPARK-6594
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.1
 Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1
Reporter: q79969786

 I use KafkaUtils to receive data from Kafka In my Spark streaming application 
 as follows:
 MapString, Integer topicorder = new HashMapString, Integer();
 topicorder.put(order, Integer.valueOf(readThread));
 JavaPairReceiverInputDStreamString, String jPRIDSOrder = 
 KafkaUtils.createStream(jssc, zkQuorum, group, topicorder);
 It worked well at fist, but after I submit this application several times, 
 Spark streaming can‘t  receive data anymore(Kafka works well).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6594) Spark Streaming can't receive data from kafka

2015-03-29 Thread q79969786 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

q79969786 updated SPARK-6594:
-
Comment: was deleted

(was: 1. Create kafka topic as follow:
$kafka-topics.sh --create --zookeeper 
zkServer1:2181,zkServer2:2181,zkServer3:2181 --replication-factor 1 
--partitions 5 --topic ORDER

2. I'm use Java API to process data as follow:
SparkConf sparkConf = new SparkConf().setAppName(TestOrder);
sparkConf.set(spark.cleaner.ttl, 600);
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new 
Duration(1000));
MapString, Integer topicorder = new HashMapString, Integer();
topicorder.put(order, 5);
JavaPairReceiverInputDStreamString, String jPRIDSOrder = 
KafkaUtils.createStream(jssc, ’zkServer1:2181,zkServer2:2181,zkServer3:2181‘, 
’test-consumer-group‘, topicorder);
jPRIDSOrder.map(new FunctionTuple2String, String, String() {
@Override
public String call(Tuple2String, String tuple2) {
return tuple2._2();
}
}).print();

3. Submit this application as follow:
spark-submit --class com.bigdata.TestOrder --master spark://SPKMASTER:19002 
/home/bigdata/test-spark.jar  TestOrder

4. It will shows five warnings as follows when submit application:
15/03/29 21:23:03 WARN ZookeeperConsumerConnector: 
[test-consumer-group_work1-1427462582342-5714642d], No broker partitions 
consumed by consumer thread test-consumer-group_work1-1427462582342-5714642d-0 
for topic ORDER
..
)

 Spark Streaming can't receive data from kafka
 -

 Key: SPARK-6594
 URL: https://issues.apache.org/jira/browse/SPARK-6594
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.1
 Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1
Reporter: q79969786

 I use KafkaUtils to receive data from Kafka In my Spark streaming application 
 as follows:
 MapString, Integer topicorder = new HashMapString, Integer();
 topicorder.put(order, Integer.valueOf(readThread));
 JavaPairReceiverInputDStreamString, String jPRIDSOrder = 
 KafkaUtils.createStream(jssc, zkQuorum, group, topicorder);
 It worked well at fist, but after I submit this application several times, 
 Spark streaming can‘t  receive data anymore(Kafka works well).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6566) Update Spark to use the latest version of Parquet libraries

2015-03-29 Thread Konstantin Shaposhnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386170#comment-14386170
 ] 

Konstantin Shaposhnikov commented on SPARK-6566:


Thank you for the update [~lian cheng]

 Update Spark to use the latest version of Parquet libraries
 ---

 Key: SPARK-6566
 URL: https://issues.apache.org/jira/browse/SPARK-6566
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Konstantin Shaposhnikov

 There are a lot of bug fixes in the latest version of parquet (1.6.0rc7). 
 E.g. PARQUET-136
 It would be good to update Spark to use the latest parquet version.
 The following changes are required:
 {code}
 diff --git a/pom.xml b/pom.xml
 index 5ad39a9..095b519 100644
 --- a/pom.xml
 +++ b/pom.xml
 @@ -132,7 +132,7 @@
  !-- Version used for internal directory structure --
  hive.version.short0.13.1/hive.version.short
  derby.version10.10.1.1/derby.version
 -parquet.version1.6.0rc3/parquet.version
 +parquet.version1.6.0rc7/parquet.version
  jblas.version1.2.3/jblas.version
  jetty.version8.1.14.v20131031/jetty.version
  orbit.version3.0.0.v201112011016/orbit.version
 {code}
 and
 {code}
 --- 
 a/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
 +++ 
 b/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
 @@ -480,7 +480,7 @@ private[parquet] class FilteringParquetRowInputFormat
  globalMetaData = new GlobalMetaData(globalMetaData.getSchema,
mergedMetadata, globalMetaData.getCreatedBy)
  
 -val readContext = getReadSupport(configuration).init(
 +val readContext = 
 ParquetInputFormat.getReadSupportInstance(configuration).init(
new InitContext(configuration,
  globalMetaData.getKeyValueMetaData,
  globalMetaData.getSchema))
 {code}
 I am happy to prepare a pull request if necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame

2015-03-29 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386167#comment-14386167
 ] 

Michael Armbrust commented on SPARK-6586:
-

I can see the utility for seeing the original plan in any given run of the
optimizer.  However, providing it for any arbitrarily assembly of query
plans feels like unnecessary complexity to me.  I think its only reasonable
to add such instrumentation when it was actually useful to solve an issue.
Doing so speculatively only leads to code complexity.  If you have a
concrete example where this information would be useful we can continue to
discuss, but otherwise this issue should be closed.

Additionally, PRs to add new features should *always* have tests.
Otherwise these features will be broken almost immediately.



 Add the capability of retrieving original logical plan of DataFrame
 ---

 Key: SPARK-6586
 URL: https://issues.apache.org/jira/browse/SPARK-6586
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor

 In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan 
 instead of logical plan. However, by doing that we can't know the logical 
 plan of a {{DataFrame}}. But it might be still useful and important to 
 retrieve the original logical plan in some use cases.
 In this pr, we introduce the capability of retrieving original logical plan 
 of {{DataFrame}}.
 The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once 
 {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as 
 {{true}}.  In {{QueryExecution}}, we keep the original logical plan in the 
 analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to 
 recursively replace the analyzed logical plan with original logical plan and 
 retrieve it.
 Besides the capability of retrieving original logical plan, this modification 
 also can avoid do plan analysis if it is already analyzed.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6261) Python MLlib API missing items: Feature

2015-03-29 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386036#comment-14386036
 ] 

Kai Sasaki commented on SPARK-6261:
---

[~josephkb] I created JIRA for IDFModel here. 
[SPARK-6598|https://issues.apache.org/jira/browse/SPARK-6598]. Thank you!

 Python MLlib API missing items: Feature
 ---

 Key: SPARK-6261
 URL: https://issues.apache.org/jira/browse/SPARK-6261
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 StandardScalerModel
 * All functionality except predict() is missing.
 IDFModel
 * idf
 Word2Vec
 * setMinCount
 Word2VecModel
 * getVectors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-29 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386049#comment-14386049
 ] 

Debasish Das commented on SPARK-5564:
-

[~josephkb] could you please point me to the datasets that are used for 
benchmarking? I have started testing loglikelihood loss for recommendation and 
since I already added the constraints, this is the right time to test it on LDA 
benchmarks as well...I will open up the code as part of 
https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears 
it...

I am looking into LDA test-cases but since I am optimizing log-likelihood 
directly, I am looking to add more testcases from your LDA JIRA...For 
recommendation, I know how to construct the testcases...

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6600) Open ports in spark-ec2.py to allow HDFS NFS gateway

2015-03-29 Thread Florian Verhein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-6600:
---
Description: 
Use case: User has set up the hadoop hdfs nfs gateway service on their 
spark-ec2.py launched cluster, and wants to mount that on their local machine. 

Requires the following ports to be opened on incoming rule set for MASTER for 
both UDP and TCP: 111, 2049, 4242.
(I have tried this and it works)

Note that this issue *does not* cover the implementation of a hdfs nfs gateway 
module in the spark-ec2 project. That should be a separate issue (TODO).  

Reference:
https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html

  was:

Use case: User has set up the hadoop hdfs nfs gateway service on their 
spark-ec2.py launched cluster, and wants to mount that on their local machine. 

Requires the following ports to be opened on incoming rule set for MASTER for 
both UDP and TCP: 111, 2049, 4242.
(I have tried this and it works)

Note that this issue *does not* cover the implementation of a hdfs nfs gateway 
module in the spark-ec2 project. That should be a separate issue (TODO).  

Reference:
https://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html


 Open ports in spark-ec2.py to allow HDFS NFS gateway  
 --

 Key: SPARK-6600
 URL: https://issues.apache.org/jira/browse/SPARK-6600
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein

 Use case: User has set up the hadoop hdfs nfs gateway service on their 
 spark-ec2.py launched cluster, and wants to mount that on their local 
 machine. 
 Requires the following ports to be opened on incoming rule set for MASTER for 
 both UDP and TCP: 111, 2049, 4242.
 (I have tried this and it works)
 Note that this issue *does not* cover the implementation of a hdfs nfs 
 gateway module in the spark-ec2 project. That should be a separate issue 
 (TODO).  
 Reference:
 https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway

2015-03-29 Thread Florian Verhein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-6600:
---
Summary: Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway(was: 
Open ports in spark-ec2.py to allow HDFS NFS gateway  )

 Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway  
 --

 Key: SPARK-6600
 URL: https://issues.apache.org/jira/browse/SPARK-6600
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein

 Use case: User has set up the hadoop hdfs nfs gateway service on their 
 spark-ec2.py launched cluster, and wants to mount that on their local 
 machine. 
 Requires the following ports to be opened on incoming rule set for MASTER for 
 both UDP and TCP: 111, 2049, 4242.
 (I have tried this and it works)
 Note that this issue *does not* cover the implementation of a hdfs nfs 
 gateway module in the spark-ec2 project. That should be a separate issue 
 (TODO).  
 Reference:
 https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2015-03-29 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386049#comment-14386049
 ] 

Debasish Das edited comment on SPARK-5564 at 3/30/15 12:31 AM:
---

[~josephkb] could you please point me to the datasets that are used for 
benchmarking LDA and how do they scale as we start scaling the topics? I have 
started testing loglikelihood loss for recommendation and since I already added 
the constraints, this is the right time to test it on LDA benchmarks as 
well...I will open up the code as part of 
https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears 
it...

I am looking into LDA test-cases but since I am optimizing log-likelihood 
directly, I am looking to add more testcases based on document and word 
matrix...For recommendation, I know how to construct the testcases with 
loglikelihood loss


was (Author: debasish83):
[~josephkb] could you please point me to the datasets that are used for 
benchmarking? I have started testing loglikelihood loss for recommendation and 
since I already added the constraints, this is the right time to test it on LDA 
benchmarks as well...I will open up the code as part of 
https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears 
it...

I am looking into LDA test-cases but since I am optimizing log-likelihood 
directly, I am looking to add more testcases based on document and word 
matrix...For recommendation, I know how to construct the testcases with 
loglikelihood loss

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2015-03-29 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386049#comment-14386049
 ] 

Debasish Das edited comment on SPARK-5564 at 3/30/15 12:30 AM:
---

[~josephkb] could you please point me to the datasets that are used for 
benchmarking? I have started testing loglikelihood loss for recommendation and 
since I already added the constraints, this is the right time to test it on LDA 
benchmarks as well...I will open up the code as part of 
https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears 
it...

I am looking into LDA test-cases but since I am optimizing log-likelihood 
directly, I am looking to add more testcases based on document and word 
matrix...For recommendation, I know how to construct the testcases with 
loglikelihood loss


was (Author: debasish83):
[~josephkb] could you please point me to the datasets that are used for 
benchmarking? I have started testing loglikelihood loss for recommendation and 
since I already added the constraints, this is the right time to test it on LDA 
benchmarks as well...I will open up the code as part of 
https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears 
it...

I am looking into LDA test-cases but since I am optimizing log-likelihood 
directly, I am looking to add more testcases from your LDA JIRA...For 
recommendation, I know how to construct the testcases...

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6600) Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway

2015-03-29 Thread Florian Verhein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-6600:
---
Description: 
Use case: User has set up the hadoop hdfs nfs gateway service on their 
spark_ec2.py launched cluster, and wants to mount that on their local machine. 

Requires the following ports to be opened on incoming rule set for MASTER for 
both UDP and TCP: 111, 2049, 4242.
(I have tried this and it works)

Note that this issue *does not* cover the implementation of a hdfs nfs gateway 
module in the spark-ec2 project. See [#6601] for this.  

Reference:
https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html

  was:
Use case: User has set up the hadoop hdfs nfs gateway service on their 
spark_ec2.py launched cluster, and wants to mount that on their local machine. 

Requires the following ports to be opened on incoming rule set for MASTER for 
both UDP and TCP: 111, 2049, 4242.
(I have tried this and it works)

Note that this issue *does not* cover the implementation of a hdfs nfs gateway 
module in the spark-ec2 project. That should be a separate issue (TODO).  

Reference:
https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html


 Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway  
 --

 Key: SPARK-6600
 URL: https://issues.apache.org/jira/browse/SPARK-6600
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein

 Use case: User has set up the hadoop hdfs nfs gateway service on their 
 spark_ec2.py launched cluster, and wants to mount that on their local 
 machine. 
 Requires the following ports to be opened on incoming rule set for MASTER for 
 both UDP and TCP: 111, 2049, 4242.
 (I have tried this and it works)
 Note that this issue *does not* cover the implementation of a hdfs nfs 
 gateway module in the spark-ec2 project. See [#6601] for this.  
 Reference:
 https://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6601) Add HDFS NFS gateway module to spark-ec2

2015-03-29 Thread Florian Verhein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florian Verhein updated SPARK-6601:
---
Description: 
Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.

Note: For nfs to be available outside AWS, also requires #6600

  was:
Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.

Note: For nfs to be available outside AWS, also requires [#6600]


 Add HDFS NFS gateway module to spark-ec2
 

 Key: SPARK-6601
 URL: https://issues.apache.org/jira/browse/SPARK-6601
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Florian Verhein

 Add module hdfs-nfs-gateway, which sets up the gateway for (say, 
 ephemeral-hdfs) as well as mounts (e.g. to /hdfs_nfs) on all nodes.
 Note: For nfs to be available outside AWS, also requires #6600



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6119) DataFrame.dropna support

2015-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385642#comment-14385642
 ] 

Apache Spark commented on SPARK-6119:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5248

 DataFrame.dropna support
 

 Key: SPARK-6119
 URL: https://issues.apache.org/jira/browse/SPARK-6119
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: DataFrame

 Support dropping rows with null values (dropna). Similar to Pandas' dropna
 http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.dropna.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-03-29 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385648#comment-14385648
 ] 

Kannan Rajah commented on SPARK-1529:
-

I have pushed the first round of commits to my repo. I would like to get some 
early feedback on the overall design.
https://github.com/rkannan82/spark/commits/dfs_shuffle

Commits:
https://github.com/rkannan82/spark/commit/ce8b430512b31e932ffdab6e0a2c1a6a1768ffbf
https://github.com/rkannan82/spark/commit/8f5415c248c0a9ca5ad3ec9f48f839b24c259813
https://github.com/rkannan82/spark/commit/d9d179ba6c685cc8eb181f442e9bd6ad91cc4290

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Kannan Rajah
 Attachments: Spark Shuffle using HDFS.pdf


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.

2015-03-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385706#comment-14385706
 ] 

Sean Owen commented on SPARK-6593:
--

At this level though, what's a bad split? a line of text that doesn't parse as 
expected? that's application-level logic. Given how little the framework knows, 
this would amount to ignoring a partition if there was any error in computing 
it, which seems too coarse to encourage people to use. You can of course handle 
this in the application logic -- catch the error, return nothing, log it, add 
to a counter, etc.

 Provide option for HadoopRDD to skip bad data splits.
 -

 Key: SPARK-6593
 URL: https://issues.apache.org/jira/browse/SPARK-6593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Dale Richardson
Priority: Minor

 When reading a large amount of files from HDFS eg. with  
 sc.textFile(hdfs:///user/cloudera/logs*.gz). If a single split is corrupted 
 then the entire job is canceled. As default behaviour this is probably for 
 the best, but it would be nice in some circumstances where you know it will 
 be ok to have the option to skip the corrupted portion and continue the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4123) Show dependency changes in pull requests

2015-03-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4123.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5093
[https://github.com/apache/spark/pull/5093]

 Show dependency changes in pull requests
 

 Key: SPARK-4123
 URL: https://issues.apache.org/jira/browse/SPARK-4123
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Brennon York
Priority: Critical
 Fix For: 1.4.0


 We should inspect the classpath of Spark's assembly jar for every pull 
 request. This only takes a few seconds in Maven and it will help weed out 
 dependency changes from the master branch. Ideally we'd post any dependency 
 changes in the pull request message.
 {code}
 $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
 INFO | tr : \n | awk -F/ '{print $NF}' | sort  my-classpath
 $ git checkout apache/master
 $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
 INFO | tr : \n | awk -F/ '{print $NF}' | sort  master-classpath
 $ diff my-classpath master-classpath
  chill-java-0.3.6.jar
  chill_2.10-0.3.6.jar
 ---
  chill-java-0.5.0.jar
  chill_2.10-0.5.0.jar
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6406) Launcher backward compatibility issues

2015-03-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6406.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5085
[https://github.com/apache/spark/pull/5085]

 Launcher backward compatibility issues
 --

 Key: SPARK-6406
 URL: https://issues.apache.org/jira/browse/SPARK-6406
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: Nishkam Ravi
Priority: Minor
 Fix For: 1.4.0


 The new launcher library breaks backward compatibility. hadoop string in 
 the spark assembly should not be mandatory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6406) Launcher backward compatibility issues

2015-03-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6406:
-
Assignee: Nishkam Ravi

 Launcher backward compatibility issues
 --

 Key: SPARK-6406
 URL: https://issues.apache.org/jira/browse/SPARK-6406
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: Nishkam Ravi
Assignee: Nishkam Ravi
Priority: Minor
 Fix For: 1.4.0


 The new launcher library breaks backward compatibility. hadoop string in 
 the spark assembly should not be mandatory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files

2015-03-29 Thread Dale Richardson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dale Richardson updated SPARK-6593:
---
Description: 
When reading a large amount of gzip files from HDFS eg. with  
sc.textFile(hdfs:///user/cloudera/logs*.gz), If the hadoop input libraries 
report an exception then the entire job is canceled. As default behaviour this 
is probably for the best, but it would be nice in some circumstances where you 
know it will be ok to have the option to skip the corrupted file and continue 
the job. 


  was:
When reading a large amount of files from HDFS eg. with  
sc.textFile(hdfs:///user/cloudera/logs*.gz). If the hadoop input libraries 
report an exception then the entire job is canceled. As default behaviour this 
is probably for the best, but it would be nice in some circumstances where you 
know it will be ok to have the option to skip the corrupted file and continue 
the job. 



 Provide option for HadoopRDD to skip corrupted files
 

 Key: SPARK-6593
 URL: https://issues.apache.org/jira/browse/SPARK-6593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Dale Richardson
Priority: Minor

 When reading a large amount of gzip files from HDFS eg. with  
 sc.textFile(hdfs:///user/cloudera/logs*.gz), If the hadoop input libraries 
 report an exception then the entire job is canceled. As default behaviour 
 this is probably for the best, but it would be nice in some circumstances 
 where you know it will be ok to have the option to skip the corrupted file 
 and continue the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5124) Standardize internal RPC interface

2015-03-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5124.

   Resolution: Fixed
Fix Version/s: 1.4.0

 Standardize internal RPC interface
 --

 Key: SPARK-5124
 URL: https://issues.apache.org/jira/browse/SPARK-5124
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu
 Fix For: 1.4.0

 Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf


 In Spark we use Akka as the RPC layer. It would be great if we can 
 standardize the internal RPC interface to facilitate testing. This will also 
 provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6594) Spark Streaming can't receive data from kafka

2015-03-29 Thread q79969786 (JIRA)
q79969786 created SPARK-6594:


 Summary: Spark Streaming can't receive data from kafka
 Key: SPARK-6594
 URL: https://issues.apache.org/jira/browse/SPARK-6594
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.1
 Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1
Reporter: q79969786


I use KafkaUtils to receive data from Kafka In my Spark streaming application 
as follows:
MapString, Integer topicorder = new HashMapString, Integer();
topicorder.put(order, Integer.valueOf(readThread));
JavaPairReceiverInputDStreamString, String jPRIDSOrder = 
KafkaUtils.createStream(jssc, zkQuorum, group, topicorder);


It worked well at fist, but after I submit this application several times, 
Spark streaming can‘t  receive data anymore(Kafka works well).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6594) Spark Streaming can't receive data from kafka

2015-03-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385743#comment-14385743
 ] 

Sean Owen commented on SPARK-6594:
--

There's no useful detail here. Can you elaborate what exactly you are running, 
what you observe? what is the input, what are the states of the topics, what 
information leads you to believe streaming is not reading?

It's certainly not true that they don't work in general. I am successfully 
using this exact combination now and have had no problems.

 Spark Streaming can't receive data from kafka
 -

 Key: SPARK-6594
 URL: https://issues.apache.org/jira/browse/SPARK-6594
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.1
 Environment: kafka_2.10-0.8.1.1 + Spark-1.2.1
Reporter: q79969786

 I use KafkaUtils to receive data from Kafka In my Spark streaming application 
 as follows:
 MapString, Integer topicorder = new HashMapString, Integer();
 topicorder.put(order, Integer.valueOf(readThread));
 JavaPairReceiverInputDStreamString, String jPRIDSOrder = 
 KafkaUtils.createStream(jssc, zkQuorum, group, topicorder);
 It worked well at fist, but after I submit this application several times, 
 Spark streaming can‘t  receive data anymore(Kafka works well).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-03-29 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385890#comment-14385890
 ] 

Steve Loughran commented on SPARK-1537:
---

# I've just tried to see where YARN-2444 stands; I can't replicate it in trunk 
but I've submitted the tests to verify that it isn't there.
# for YARN-2423 Spark seems kind of trapped. It needs an api tagged as 
public/stable; Robert's patch has the API, except it's being rejected on the 
basis that ATSv2 will break it. So it can't be tagged as stable. So there's 
no API for GET operations until some undefined time {{t1   now()}} —and then, 
only for Hadoop versions with it. Which implies it won't get picked up by Spark 
for a long time.

I think we need to talk to the YARN dev team and see what can be done here. 
Even if there's no API client bundled into YARN, unless the v1 API and its 
paths beginning {{/ws/v1/timeline/}} are going to go away, then a REST client 
is possible; it may just have to be done spark-side, where at least it can be 
made resilient to hadoop versions. 


 Add integration with Yarn's Application Timeline Server
 ---

 Key: SPARK-1537
 URL: https://issues.apache.org/jira/browse/SPARK-1537
 Project: Spark
  Issue Type: New Feature
  Components: YARN
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Attachments: SPARK-1537.txt, spark-1573.patch


 It would be nice to have Spark integrate with Yarn's Application Timeline 
 Server (see YARN-321, YARN-1530). This would allow users running Spark on 
 Yarn to have a single place to go for all their history needs, and avoid 
 having to manage a separate service (Spark's built-in server).
 At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
 although there is still some ongoing work. But the basics are there, and I 
 wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame

2015-03-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6586.
-
Resolution: Not a Problem

You can already get the original plan from a DataFrame: 
{{df.queryExecution.logical}}.

 Add the capability of retrieving original logical plan of DataFrame
 ---

 Key: SPARK-6586
 URL: https://issues.apache.org/jira/browse/SPARK-6586
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor

 In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan 
 instead of logical plan. However, by doing that we can't know the logical 
 plan of a {{DataFrame}}. But it might be still useful and important to 
 retrieve the original logical plan in some use cases.
 In this pr, we introduce the capability of retrieving original logical plan 
 of {{DataFrame}}.
 The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once 
 {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as 
 {{true}}.  In {{QueryExecution}}, we keep the original logical plan in the 
 analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to 
 recursively replace the analyzed logical plan with original logical plan and 
 retrieve it.
 Besides the capability of retrieving original logical plan, this modification 
 also can avoid do plan analysis if it is already analyzed.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6548) Adding stddev to DataFrame functions

2015-03-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6548:

Target Version/s: 1.4.0

 Adding stddev to DataFrame functions
 

 Key: SPARK-6548
 URL: https://issues.apache.org/jira/browse/SPARK-6548
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
  Labels: DataFrame, starter

 Add it to the list of aggregate functions:
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
 Also add it to 
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
 We can either add a Stddev Catalyst expression, or just compute it using 
 existing functions like here: 
 https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6592) API of Row trait should be presented in Scala doc

2015-03-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6592:

Target Version/s: 1.4.0

 API of Row trait should be presented in Scala doc
 -

 Key: SPARK-6592
 URL: https://issues.apache.org/jira/browse/SPARK-6592
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Nan Zhu
Priority: Critical

 Currently, the API of Row class is not presented in Scaladoc, though we have 
 many chances to use it 
 the reason is that we ignore all files under catalyst directly in 
 SparkBuild.scala when generating Scaladoc, 
 (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369)
 What's the best approach to fix this? [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6592) API of Row trait should be presented in Scala doc

2015-03-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6592:

Priority: Critical  (was: Major)

 API of Row trait should be presented in Scala doc
 -

 Key: SPARK-6592
 URL: https://issues.apache.org/jira/browse/SPARK-6592
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Nan Zhu
Priority: Critical

 Currently, the API of Row class is not presented in Scaladoc, though we have 
 many chances to use it 
 the reason is that we ignore all files under catalyst directly in 
 SparkBuild.scala when generating Scaladoc, 
 (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369)
 What's the best approach to fix this? [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message

2015-03-29 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6587:
--
Description: 
(Don't know if this is a functionality bug, error reporting bug or an RFE ...)

I define the following hierarchy:

{code}
private abstract class MyHolder
private case class StringHolder(s: String) extends MyHolder
private case class IntHolder(i: Int) extends MyHolder
private case class BooleanHolder(b: Boolean) extends MyHolder
{code}

and a top level case class:

{code}
private case class Thing(key: Integer, foo: MyHolder)
{code}

When I try to convert it:

{code}
val things = Seq(
  Thing(1, IntHolder(42)),
  Thing(2, StringHolder(hello)),
  Thing(3, BooleanHolder(false))
)
val thingsDF = sc.parallelize(things, 4).toDF()

thingsDF.registerTempTable(things)

val all = sqlContext.sql(SELECT * from things)
{code}

I get the following stack trace:

{noformat}
Exception in thread main scala.MatchError: 
sql.CaseClassSchemaProblem.MyHolder (of class 
scala.reflect.internal.Types$ClassNoArgsTypeRef)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157)
at scala.collection.immutable.List.map(List.scala:276)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312)
at 
org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250)
at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35)
at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
{noformat}

I wrote this to answer [a question on 
StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql]
 which uses a much simpler approach and suffers the same problem.

Looking at what seems to me to be the [relevant unit test 
suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala]
 I see that this case is not covered.  

  was:
(Don't know if this is a functionality bug, error reporting bug or an RFE ...)

I define the following hierarchy:

{code}
private abstract class MyHolder
private case class StringHolder(s: String) extends MyHolder
private case class IntHolder(i: Int) extends MyHolder
private case class BooleanHolder(b: Boolean) extends MyHolder
{code}

and a top level case class:

{code}
private case class Thing(key: Integer, foo: MyHolder)
{code}

When I try to convert it:

{code}
val things = Seq(
  Thing(1, IntHolder(42)),
  Thing(2, StringHolder(hello)),
  Thing(3, BooleanHolder(false))
)
val thingsDF = sc.parallelize(things, 4).toDF()

thingsDF.registerTempTable(things)

val all = sqlContext.sql(SELECT * from things)
{code}

I get the following stack trace:

{quote}
Exception in thread main scala.MatchError: 
sql.CaseClassSchemaProblem.MyHolder (of class 
scala.reflect.internal.Types$ClassNoArgsTypeRef)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157)
at scala.collection.immutable.List.map(List.scala:276)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
at 

[jira] [Commented] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message

2015-03-29 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385695#comment-14385695
 ] 

Cheng Lian commented on SPARK-6587:
---

This behavior is expected. There are two problems in your case:

# Because {{things}} contains instances of all three case classes, the type of 
{{things}} is {{Seq[MyHolder]}}. Since {{MyHolder}} doesn't extend {{Product}}, 
can't be recognized by {{ScalaReflection}}.

# You can only use a single concrete case class {{T}} when converting 
{{RDD[T]}} or {{Seq[T]}} to a DataFrame. For {{things}}, we can't figure out 
what data type should the {{foo}} field in the reflected schema have.

 Inferring schema for case class hierarchy fails with mysterious message
 ---

 Key: SPARK-6587
 URL: https://issues.apache.org/jira/browse/SPARK-6587
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: At least Windows 8, Scala 2.11.2.  
Reporter: Spiro Michaylov

 (Don't know if this is a functionality bug, error reporting bug or an RFE ...)
 I define the following hierarchy:
 {code}
 private abstract class MyHolder
 private case class StringHolder(s: String) extends MyHolder
 private case class IntHolder(i: Int) extends MyHolder
 private case class BooleanHolder(b: Boolean) extends MyHolder
 {code}
 and a top level case class:
 {code}
 private case class Thing(key: Integer, foo: MyHolder)
 {code}
 When I try to convert it:
 {code}
 val things = Seq(
   Thing(1, IntHolder(42)),
   Thing(2, StringHolder(hello)),
   Thing(3, BooleanHolder(false))
 )
 val thingsDF = sc.parallelize(things, 4).toDF()
 thingsDF.registerTempTable(things)
 val all = sqlContext.sql(SELECT * from things)
 {code}
 I get the following stack trace:
 {noformat}
 Exception in thread main scala.MatchError: 
 sql.CaseClassSchemaProblem.MyHolder (of class 
 scala.reflect.internal.Types$ClassNoArgsTypeRef)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157)
   at scala.collection.immutable.List.map(List.scala:276)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
   at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312)
   at 
 org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250)
   at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35)
   at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 {noformat}
 I wrote this to answer [a question on 
 StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql]
  which uses a much simpler approach and suffers the same problem.
 Looking at what seems to me to be the [relevant unit test 
 suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala]
  I see that this case is not covered.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message

2015-03-29 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385695#comment-14385695
 ] 

Cheng Lian edited comment on SPARK-6587 at 3/29/15 10:32 AM:
-

This behavior is expected. There are two problems in your case:
# Because {{things}} contains instances of all three case classes, the type of 
{{things}} is {{Seq[MyHolder]}}. Since {{MyHolder}} doesn't extend {{Product}}, 
can't be recognized by {{ScalaReflection}}.
# You can only use a single concrete case class {{T}} when converting 
{{RDD[T]}} or {{Seq[T]}} to a DataFrame. For {{things}}, we can't figure out 
what data type should the {{foo}} field in the reflected schema have.


was (Author: lian cheng):
This behavior is expected. There are two problems in your case:

# Because {{things}} contains instances of all three case classes, the type of 
{{things}} is {{Seq[MyHolder]}}. Since {{MyHolder}} doesn't extend {{Product}}, 
can't be recognized by {{ScalaReflection}}.

# You can only use a single concrete case class {{T}} when converting 
{{RDD[T]}} or {{Seq[T]}} to a DataFrame. For {{things}}, we can't figure out 
what data type should the {{foo}} field in the reflected schema have.

 Inferring schema for case class hierarchy fails with mysterious message
 ---

 Key: SPARK-6587
 URL: https://issues.apache.org/jira/browse/SPARK-6587
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: At least Windows 8, Scala 2.11.2.  
Reporter: Spiro Michaylov

 (Don't know if this is a functionality bug, error reporting bug or an RFE ...)
 I define the following hierarchy:
 {code}
 private abstract class MyHolder
 private case class StringHolder(s: String) extends MyHolder
 private case class IntHolder(i: Int) extends MyHolder
 private case class BooleanHolder(b: Boolean) extends MyHolder
 {code}
 and a top level case class:
 {code}
 private case class Thing(key: Integer, foo: MyHolder)
 {code}
 When I try to convert it:
 {code}
 val things = Seq(
   Thing(1, IntHolder(42)),
   Thing(2, StringHolder(hello)),
   Thing(3, BooleanHolder(false))
 )
 val thingsDF = sc.parallelize(things, 4).toDF()
 thingsDF.registerTempTable(things)
 val all = sqlContext.sql(SELECT * from things)
 {code}
 I get the following stack trace:
 {noformat}
 Exception in thread main scala.MatchError: 
 sql.CaseClassSchemaProblem.MyHolder (of class 
 scala.reflect.internal.Types$ClassNoArgsTypeRef)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157)
   at scala.collection.immutable.List.map(List.scala:276)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
   at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312)
   at 
 org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250)
   at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35)
   at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 {noformat}
 I wrote this to answer [a question on 
 StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql]
  which uses a much simpler approach and suffers the same problem.
 Looking at what seems to me to be the [relevant unit test 
 suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala]
  I see that this case is not covered.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Created] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.

2015-03-29 Thread Dale Richardson (JIRA)
Dale Richardson created SPARK-6593:
--

 Summary: Provide option for HadoopRDD to skip bad data splits.
 Key: SPARK-6593
 URL: https://issues.apache.org/jira/browse/SPARK-6593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Dale Richardson
Priority: Minor


When reading a large amount of files from HDFS eg. with  
sc.textFile(hdfs:///user/cloudera/logs*.gz). If a single split is corrupted 
then the entire job is canceled. As default behaviour this is probably for the 
best, but it would be nice in some circumstances where you know it will be ok 
to have the option to skip the corrupted portion and continue the job. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6558) Utils.getCurrentUserName returns the full principal name instead of login name

2015-03-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6558.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5229
[https://github.com/apache/spark/pull/5229]

 Utils.getCurrentUserName returns the full principal name instead of login name
 --

 Key: SPARK-6558
 URL: https://issues.apache.org/jira/browse/SPARK-6558
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
Priority: Critical
 Fix For: 1.4.0


 Utils.getCurrentUserName returns 
 UserGroupInformation.getCurrentUser().getUserName() when SPARK_USER isn't 
 set.  It should return 
 UserGroupInformation.getCurrentUser().getShortUserName()
 getUserName() returns the users full principal name (ie us...@corp.com). 
 getShortUserName() returns just the users login name (user1).
 This just happens to work on YARN because the Client code sets:
 env(SPARK_USER) = 
 UserGroupInformation.getCurrentUser().getShortUserName()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6585) FileServerSuite.test (HttpFileServer should not work with SSL when the server is untrusted) failed is some evn.

2015-03-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6585.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5239
[https://github.com/apache/spark/pull/5239]

 FileServerSuite.test (HttpFileServer should not work with SSL when the 
 server is untrusted) failed is some evn.
 -

 Key: SPARK-6585
 URL: https://issues.apache.org/jira/browse/SPARK-6585
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.3.0
Reporter: June
Priority: Minor
 Fix For: 1.4.0


 In my test machine, FileServerSuite.test (HttpFileServer should not work 
 with SSL when the server is untrusted) case throw SSLException not 
 SSLHandshakeException, suggest change to catch SSLException to  improve test 
 case 's robustness.
 [info] - HttpFileServer should not work with SSL when the server is untrusted 
 *** FAILED *** (69 milliseconds)
 [info]   Expected exception javax.net.ssl.SSLHandshakeException to be thrown, 
 but javax.net.ssl.SSLException was thrown. (FileServerSuite.scala:231)
 [info]   org.scalatest.exceptions.TestFailedException:
 [info]   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496)
 [info]   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
 [info]   at org.scalatest.Assertions$class.intercept(Assertions.scala:1004)
 [info]   at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
 [info]   at 
 org.apache.spark.FileServerSuite$$anonfun$15.apply$mcV$sp(FileServerSuite.scala:231)
 [info]   at 
 org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
 [info]   at 
 org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
 [info]   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
 [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
 [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
 [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
 [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
 [info]   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
 [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
 [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
 [info]   at 
 org.apache.spark.FileServerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(FileServerSuite.scala:34)
 [info]   at 
 org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files

2015-03-29 Thread Dale Richardson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dale Richardson updated SPARK-6593:
---
Summary: Provide option for HadoopRDD to skip corrupted files  (was: 
Provide option for HadoopRDD to skip bad data splits.)

 Provide option for HadoopRDD to skip corrupted files
 

 Key: SPARK-6593
 URL: https://issues.apache.org/jira/browse/SPARK-6593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Dale Richardson
Priority: Minor

 When reading a large amount of files from HDFS eg. with  
 sc.textFile(hdfs:///user/cloudera/logs*.gz). If the hadoop input libraries 
 report an exception then the entire job is canceled. As default behaviour 
 this is probably for the best, but it would be nice in some circumstances 
 where you know it will be ok to have the option to skip the corrupted portion 
 and continue the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files

2015-03-29 Thread Dale Richardson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385723#comment-14385723
 ] 

Dale Richardson commented on SPARK-6593:


Changed the title and description to focus closer on my particular use case, 
which is corrupted gzip files.

 Provide option for HadoopRDD to skip corrupted files
 

 Key: SPARK-6593
 URL: https://issues.apache.org/jira/browse/SPARK-6593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Dale Richardson
Priority: Minor

 When reading a large amount of files from HDFS eg. with  
 sc.textFile(hdfs:///user/cloudera/logs*.gz). If the hadoop input libraries 
 report an exception then the entire job is canceled. As default behaviour 
 this is probably for the best, but it would be nice in some circumstances 
 where you know it will be ok to have the option to skip the corrupted portion 
 and continue the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6585) FileServerSuite.test (HttpFileServer should not work with SSL when the server is untrusted) failed is some evn.

2015-03-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6585:
-
Assignee: June

 FileServerSuite.test (HttpFileServer should not work with SSL when the 
 server is untrusted) failed is some evn.
 -

 Key: SPARK-6585
 URL: https://issues.apache.org/jira/browse/SPARK-6585
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.3.0
Reporter: June
Assignee: June
Priority: Minor
 Fix For: 1.4.0


 In my test machine, FileServerSuite.test (HttpFileServer should not work 
 with SSL when the server is untrusted) case throw SSLException not 
 SSLHandshakeException, suggest change to catch SSLException to  improve test 
 case 's robustness.
 [info] - HttpFileServer should not work with SSL when the server is untrusted 
 *** FAILED *** (69 milliseconds)
 [info]   Expected exception javax.net.ssl.SSLHandshakeException to be thrown, 
 but javax.net.ssl.SSLException was thrown. (FileServerSuite.scala:231)
 [info]   org.scalatest.exceptions.TestFailedException:
 [info]   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496)
 [info]   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
 [info]   at org.scalatest.Assertions$class.intercept(Assertions.scala:1004)
 [info]   at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
 [info]   at 
 org.apache.spark.FileServerSuite$$anonfun$15.apply$mcV$sp(FileServerSuite.scala:231)
 [info]   at 
 org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
 [info]   at 
 org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224)
 [info]   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
 [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
 [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
 [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
 [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
 [info]   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
 [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
 [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
 [info]   at 
 org.apache.spark.FileServerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(FileServerSuite.scala:34)
 [info]   at 
 org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files

2015-03-29 Thread Dale Richardson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dale Richardson updated SPARK-6593:
---
Description: 
When reading a large amount of files from HDFS eg. with  
sc.textFile(hdfs:///user/cloudera/logs*.gz). If the hadoop input libraries 
report an exception then the entire job is canceled. As default behaviour this 
is probably for the best, but it would be nice in some circumstances where you 
know it will be ok to have the option to skip the corrupted file and continue 
the job. 


  was:
When reading a large amount of files from HDFS eg. with  
sc.textFile(hdfs:///user/cloudera/logs*.gz). If the hadoop input libraries 
report an exception then the entire job is canceled. As default behaviour this 
is probably for the best, but it would be nice in some circumstances where you 
know it will be ok to have the option to skip the corrupted portion and 
continue the job. 



 Provide option for HadoopRDD to skip corrupted files
 

 Key: SPARK-6593
 URL: https://issues.apache.org/jira/browse/SPARK-6593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Dale Richardson
Priority: Minor

 When reading a large amount of files from HDFS eg. with  
 sc.textFile(hdfs:///user/cloudera/logs*.gz). If the hadoop input libraries 
 report an exception then the entire job is canceled. As default behaviour 
 this is probably for the best, but it would be nice in some circumstances 
 where you know it will be ok to have the option to skip the corrupted file 
 and continue the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6580) Optimize LogisticRegressionModel.predictPoint

2015-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6580:
---

Assignee: (was: Apache Spark)

 Optimize LogisticRegressionModel.predictPoint
 -

 Key: SPARK-6580
 URL: https://issues.apache.org/jira/browse/SPARK-6580
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 LogisticRegressionModel.predictPoint could be optimized some.  There are 
 several checks which could be moved outside loops or even outside 
 predictPoint to initialization of the model.
 Some include:
 {code}
 require(numFeatures == weightMatrix.size)
 val dataWithBiasSize = weightMatrix.size / (numClasses - 1)
 val weightsArray = weightMatrix match { ...
 if (dataMatrix.size + 1 == dataWithBiasSize) {...
 {code}
 Also, for multiclass, the 2 loops (over numClasses and margins) could be 
 combined into 1 loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6580) Optimize LogisticRegressionModel.predictPoint

2015-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385664#comment-14385664
 ] 

Apache Spark commented on SPARK-6580:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/5249

 Optimize LogisticRegressionModel.predictPoint
 -

 Key: SPARK-6580
 URL: https://issues.apache.org/jira/browse/SPARK-6580
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 LogisticRegressionModel.predictPoint could be optimized some.  There are 
 several checks which could be moved outside loops or even outside 
 predictPoint to initialization of the model.
 Some include:
 {code}
 require(numFeatures == weightMatrix.size)
 val dataWithBiasSize = weightMatrix.size / (numClasses - 1)
 val weightsArray = weightMatrix match { ...
 if (dataMatrix.size + 1 == dataWithBiasSize) {...
 {code}
 Also, for multiclass, the 2 loops (over numClasses and margins) could be 
 combined into 1 loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.

2015-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385715#comment-14385715
 ] 

Apache Spark commented on SPARK-6593:
-

User 'tigerquoll' has created a pull request for this issue:
https://github.com/apache/spark/pull/5250

 Provide option for HadoopRDD to skip bad data splits.
 -

 Key: SPARK-6593
 URL: https://issues.apache.org/jira/browse/SPARK-6593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Dale Richardson
Priority: Minor

 When reading a large amount of files from HDFS eg. with  
 sc.textFile(hdfs:///user/cloudera/logs*.gz). If a single split is corrupted 
 then the entire job is canceled. As default behaviour this is probably for 
 the best, but it would be nice in some circumstances where you know it will 
 be ok to have the option to skip the corrupted portion and continue the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.

2015-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6593:
---

Assignee: (was: Apache Spark)

 Provide option for HadoopRDD to skip bad data splits.
 -

 Key: SPARK-6593
 URL: https://issues.apache.org/jira/browse/SPARK-6593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Dale Richardson
Priority: Minor

 When reading a large amount of files from HDFS eg. with  
 sc.textFile(hdfs:///user/cloudera/logs*.gz). If a single split is corrupted 
 then the entire job is canceled. As default behaviour this is probably for 
 the best, but it would be nice in some circumstances where you know it will 
 be ok to have the option to skip the corrupted portion and continue the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.

2015-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6593:
---

Assignee: Apache Spark

 Provide option for HadoopRDD to skip bad data splits.
 -

 Key: SPARK-6593
 URL: https://issues.apache.org/jira/browse/SPARK-6593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Dale Richardson
Assignee: Apache Spark
Priority: Minor

 When reading a large amount of files from HDFS eg. with  
 sc.textFile(hdfs:///user/cloudera/logs*.gz). If a single split is corrupted 
 then the entire job is canceled. As default behaviour this is probably for 
 the best, but it would be nice in some circumstances where you know it will 
 be ok to have the option to skip the corrupted portion and continue the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc

2015-03-29 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385724#comment-14385724
 ] 

Nan Zhu commented on SPARK-6592:


?

I don't think that makes any difference, as the path of Row.scala still 
contains spark/sql/catalyst?

I also tried to rerun build/sbt doc, the same thing...

maybe we need to hack SparkBuild.scala to exclude Row.scala?

 API of Row trait should be presented in Scala doc
 -

 Key: SPARK-6592
 URL: https://issues.apache.org/jira/browse/SPARK-6592
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Nan Zhu

 Currently, the API of Row class is not presented in Scaladoc, though we have 
 many chances to use it 
 the reason is that we ignore all files under catalyst directly in 
 SparkBuild.scala when generating Scaladoc, 
 (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369)
 What's the best approach to fix this? [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.

2015-03-29 Thread Dale Richardson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385716#comment-14385716
 ] 

Dale Richardson edited comment on SPARK-6593 at 3/29/15 11:35 AM:
--

With a gz file for example, the entire file is a split. so a corrupted gz file 
will kill the entire job - with no way of catching and remediating the error.


was (Author: tigerquoll):
With a gz file for example, the entire file is a split. so a corrupted gz file 
will kill the entire job.

 Provide option for HadoopRDD to skip bad data splits.
 -

 Key: SPARK-6593
 URL: https://issues.apache.org/jira/browse/SPARK-6593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Dale Richardson
Priority: Minor

 When reading a large amount of files from HDFS eg. with  
 sc.textFile(hdfs:///user/cloudera/logs*.gz). If a single split is corrupted 
 then the entire job is canceled. As default behaviour this is probably for 
 the best, but it would be nice in some circumstances where you know it will 
 be ok to have the option to skip the corrupted portion and continue the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.

2015-03-29 Thread Dale Richardson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385716#comment-14385716
 ] 

Dale Richardson commented on SPARK-6593:


With a gz file for example, the entire file is a split. so a corrupted gz file 
will kill the entire job.

 Provide option for HadoopRDD to skip bad data splits.
 -

 Key: SPARK-6593
 URL: https://issues.apache.org/jira/browse/SPARK-6593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Dale Richardson
Priority: Minor

 When reading a large amount of files from HDFS eg. with  
 sc.textFile(hdfs:///user/cloudera/logs*.gz). If a single split is corrupted 
 then the entire job is canceled. As default behaviour this is probably for 
 the best, but it would be nice in some circumstances where you know it will 
 be ok to have the option to skip the corrupted portion and continue the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6580) Optimize LogisticRegressionModel.predictPoint

2015-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6580:
---

Assignee: Apache Spark

 Optimize LogisticRegressionModel.predictPoint
 -

 Key: SPARK-6580
 URL: https://issues.apache.org/jira/browse/SPARK-6580
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Apache Spark
Priority: Minor

 LogisticRegressionModel.predictPoint could be optimized some.  There are 
 several checks which could be moved outside loops or even outside 
 predictPoint to initialization of the model.
 Some include:
 {code}
 require(numFeatures == weightMatrix.size)
 val dataWithBiasSize = weightMatrix.size / (numClasses - 1)
 val weightsArray = weightMatrix match { ...
 if (dataMatrix.size + 1 == dataWithBiasSize) {...
 {code}
 Also, for multiclass, the 2 loops (over numClasses and margins) could be 
 combined into 1 loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6579) save as parquet with overwrite failed

2015-03-29 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385694#comment-14385694
 ] 

Cheng Lian commented on SPARK-6579:
---

Here's another Parquet issue with Hadoop 1.0.4: SPARK-6581.

 save as parquet with overwrite failed
 -

 Key: SPARK-6579
 URL: https://issues.apache.org/jira/browse/SPARK-6579
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.4.0
Reporter: Davies Liu
Assignee: Michael Armbrust
Priority: Critical

 {code}
 df = sc.parallelize(xrange(n), 4).map(lambda x: (x, str(x) * 
 2,)).toDF(['int', 'str'])
 df.save(test_data, source=parquet, mode='overwrite')
 df.save(test_data, source=parquet, mode='overwrite')
 {code}
 it failed with:
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in 
 stage 3.0 failed 1 times, most recent failure: Lost task 3.0 in stage 3.0 
 (TID 6, localhost): java.lang.IllegalArgumentException: You cannot call 
 toBytes() more than once without calling reset()
   at parquet.Preconditions.checkArgument(Preconditions.java:47)
   at 
 parquet.column.values.rle.RunLengthBitPackingHybridEncoder.toBytes(RunLengthBitPackingHybridEncoder.java:254)
   at 
 parquet.column.values.rle.RunLengthBitPackingHybridValuesWriter.getBytes(RunLengthBitPackingHybridValuesWriter.java:68)
   at 
 parquet.column.impl.ColumnWriterImpl.writePage(ColumnWriterImpl.java:147)
   at parquet.column.impl.ColumnWriterImpl.flush(ColumnWriterImpl.java:236)
   at 
 parquet.column.impl.ColumnWriteStoreImpl.flush(ColumnWriteStoreImpl.java:113)
   at 
 parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153)
   at 
 parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
   at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:663)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:677)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:677)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:212)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1211)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1200)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1199)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1199)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1399)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1360)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 {code}
 run it again, it failed with:
 {code}
 15/03/27 13:26:16 WARN FSInputChecker: Problem opening checksum file: 
 file:/Users/davies/work/spark/tmp/test_data/_temporary/_attempt_201503271324_0011_r_03_0/part-r-4.parquet.
   Ignoring exception: java.io.EOFException
   at java.io.DataInputStream.readFully(DataInputStream.java:197)
   at java.io.DataInputStream.readFully(DataInputStream.java:169)
   at 
 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:134)
   at 
 org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
   at 
 parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:402)
   at 
 

[jira] [Updated] (SPARK-6579) save as parquet with overwrite failed when linking with Hadoop 1.0.4

2015-03-29 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6579:
--
Summary: save as parquet with overwrite failed when linking with Hadoop 
1.0.4  (was: save as parquet with overwrite failed)

 save as parquet with overwrite failed when linking with Hadoop 1.0.4
 

 Key: SPARK-6579
 URL: https://issues.apache.org/jira/browse/SPARK-6579
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.4.0
Reporter: Davies Liu
Assignee: Michael Armbrust
Priority: Critical

 {code}
 df = sc.parallelize(xrange(n), 4).map(lambda x: (x, str(x) * 
 2,)).toDF(['int', 'str'])
 df.save(test_data, source=parquet, mode='overwrite')
 df.save(test_data, source=parquet, mode='overwrite')
 {code}
 it failed with:
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in 
 stage 3.0 failed 1 times, most recent failure: Lost task 3.0 in stage 3.0 
 (TID 6, localhost): java.lang.IllegalArgumentException: You cannot call 
 toBytes() more than once without calling reset()
   at parquet.Preconditions.checkArgument(Preconditions.java:47)
   at 
 parquet.column.values.rle.RunLengthBitPackingHybridEncoder.toBytes(RunLengthBitPackingHybridEncoder.java:254)
   at 
 parquet.column.values.rle.RunLengthBitPackingHybridValuesWriter.getBytes(RunLengthBitPackingHybridValuesWriter.java:68)
   at 
 parquet.column.impl.ColumnWriterImpl.writePage(ColumnWriterImpl.java:147)
   at parquet.column.impl.ColumnWriterImpl.flush(ColumnWriterImpl.java:236)
   at 
 parquet.column.impl.ColumnWriteStoreImpl.flush(ColumnWriteStoreImpl.java:113)
   at 
 parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153)
   at 
 parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
   at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:663)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:677)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:677)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:212)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1211)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1200)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1199)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1199)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1399)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1360)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 {code}
 run it again, it failed with:
 {code}
 15/03/27 13:26:16 WARN FSInputChecker: Problem opening checksum file: 
 file:/Users/davies/work/spark/tmp/test_data/_temporary/_attempt_201503271324_0011_r_03_0/part-r-4.parquet.
   Ignoring exception: java.io.EOFException
   at java.io.DataInputStream.readFully(DataInputStream.java:197)
   at java.io.DataInputStream.readFully(DataInputStream.java:169)
   at 
 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:134)
   at 
 org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
   at 
 

[jira] [Updated] (SPARK-6593) Provide option for HadoopRDD to skip bad data splits.

2015-03-29 Thread Dale Richardson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dale Richardson updated SPARK-6593:
---
Description: 
When reading a large amount of files from HDFS eg. with  
sc.textFile(hdfs:///user/cloudera/logs*.gz). If the hadoop input libraries 
report an exception then the entire job is canceled. As default behaviour this 
is probably for the best, but it would be nice in some circumstances where you 
know it will be ok to have the option to skip the corrupted portion and 
continue the job. 


  was:
When reading a large amount of files from HDFS eg. with  
sc.textFile(hdfs:///user/cloudera/logs*.gz). If a single split is corrupted 
then the entire job is canceled. As default behaviour this is probably for the 
best, but it would be nice in some circumstances where you know it will be ok 
to have the option to skip the corrupted portion and continue the job. 



 Provide option for HadoopRDD to skip bad data splits.
 -

 Key: SPARK-6593
 URL: https://issues.apache.org/jira/browse/SPARK-6593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Dale Richardson
Priority: Minor

 When reading a large amount of files from HDFS eg. with  
 sc.textFile(hdfs:///user/cloudera/logs*.gz). If the hadoop input libraries 
 report an exception then the entire job is canceled. As default behaviour 
 this is probably for the best, but it would be nice in some circumstances 
 where you know it will be ok to have the option to skip the corrupted portion 
 and continue the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6595) DataFrame self joins with MetastoreRelations fail

2015-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385924#comment-14385924
 ] 

Apache Spark commented on SPARK-6595:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/5251

 DataFrame self joins with MetastoreRelations fail
 -

 Key: SPARK-6595
 URL: https://issues.apache.org/jira/browse/SPARK-6595
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6595) DataFrame self joins with MetastoreRelations fail

2015-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6595:
---

Assignee: Michael Armbrust  (was: Apache Spark)

 DataFrame self joins with MetastoreRelations fail
 -

 Key: SPARK-6595
 URL: https://issues.apache.org/jira/browse/SPARK-6595
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6595) DataFrame self joins with MetastoreRelations fail

2015-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6595:
---

Assignee: Apache Spark  (was: Michael Armbrust)

 DataFrame self joins with MetastoreRelations fail
 -

 Key: SPARK-6595
 URL: https://issues.apache.org/jira/browse/SPARK-6595
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Michael Armbrust
Assignee: Apache Spark
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc

2015-03-29 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385964#comment-14385964
 ] 

Nan Zhu commented on SPARK-6592:


it contains

the reason is that the input of that line is file.getCanonicalPath...which 
output the absolute path

e.g.

{code}

scala val f = new java.io.File(Row.class)
f: java.io.File = Row.class

scala f.getCanonicalPath
res0: String = 
/Users/nanzhu/code/spark/sql/catalyst/target/scala-2.10/classes/org/apache/spark/sql/Row.class

{code}


 API of Row trait should be presented in Scala doc
 -

 Key: SPARK-6592
 URL: https://issues.apache.org/jira/browse/SPARK-6592
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Nan Zhu
Priority: Critical

 Currently, the API of Row class is not presented in Scaladoc, though we have 
 many chances to use it 
 the reason is that we ignore all files under catalyst directly in 
 SparkBuild.scala when generating Scaladoc, 
 (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369)
 What's the best approach to fix this? [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc

2015-03-29 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385895#comment-14385895
 ] 

Reynold Xin commented on SPARK-6592:


Row.html/class doesn't contain the word catalyst, does it?

./api/java/org/apache/spark/sql/Row.html



 API of Row trait should be presented in Scala doc
 -

 Key: SPARK-6592
 URL: https://issues.apache.org/jira/browse/SPARK-6592
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Nan Zhu
Priority: Critical

 Currently, the API of Row class is not presented in Scaladoc, though we have 
 many chances to use it 
 the reason is that we ignore all files under catalyst directly in 
 SparkBuild.scala when generating Scaladoc, 
 (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369)
 What's the best approach to fix this? [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6595) DataFrame self joins with MetastoreRelations fail

2015-03-29 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-6595:
---

 Summary: DataFrame self joins with MetastoreRelations fail
 Key: SPARK-6595
 URL: https://issues.apache.org/jira/browse/SPARK-6595
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc

2015-03-29 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385970#comment-14385970
 ] 

Reynold Xin commented on SPARK-6592:


Ok then can't you just add apache to it?


 API of Row trait should be presented in Scala doc
 -

 Key: SPARK-6592
 URL: https://issues.apache.org/jira/browse/SPARK-6592
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Nan Zhu
Priority: Critical

 Currently, the API of Row class is not presented in Scaladoc, though we have 
 many chances to use it 
 the reason is that we ignore all files under catalyst directly in 
 SparkBuild.scala when generating Scaladoc, 
 (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369)
 What's the best approach to fix this? [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6592) API of Row trait should be presented in Scala doc

2015-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6592:
---

Assignee: (was: Apache Spark)

 API of Row trait should be presented in Scala doc
 -

 Key: SPARK-6592
 URL: https://issues.apache.org/jira/browse/SPARK-6592
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Nan Zhu
Priority: Critical

 Currently, the API of Row class is not presented in Scaladoc, though we have 
 many chances to use it 
 the reason is that we ignore all files under catalyst directly in 
 SparkBuild.scala when generating Scaladoc, 
 (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369)
 What's the best approach to fix this? [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc

2015-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385991#comment-14385991
 ] 

Apache Spark commented on SPARK-6592:
-

User 'CodingCat' has created a pull request for this issue:
https://github.com/apache/spark/pull/5252

 API of Row trait should be presented in Scala doc
 -

 Key: SPARK-6592
 URL: https://issues.apache.org/jira/browse/SPARK-6592
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Nan Zhu
Priority: Critical

 Currently, the API of Row class is not presented in Scaladoc, though we have 
 many chances to use it 
 the reason is that we ignore all files under catalyst directly in 
 SparkBuild.scala when generating Scaladoc, 
 (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369)
 What's the best approach to fix this? [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6592) API of Row trait should be presented in Scala doc

2015-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6592:
---

Assignee: Apache Spark

 API of Row trait should be presented in Scala doc
 -

 Key: SPARK-6592
 URL: https://issues.apache.org/jira/browse/SPARK-6592
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.3.0
Reporter: Nan Zhu
Assignee: Apache Spark
Priority: Critical

 Currently, the API of Row class is not presented in Scaladoc, though we have 
 many chances to use it 
 the reason is that we ignore all files under catalyst directly in 
 SparkBuild.scala when generating Scaladoc, 
 (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369)
 What's the best approach to fix this? [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6596) fix the instruction on building scaladoc

2015-03-29 Thread Nan Zhu (JIRA)
Nan Zhu created SPARK-6596:
--

 Summary: fix the instruction on building scaladoc 
 Key: SPARK-6596
 URL: https://issues.apache.org/jira/browse/SPARK-6596
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Nan Zhu


In README.md under docs/ directory, it says that 

You can build just the Spark scaladoc by running build/sbt doc from the 
SPARK_PROJECT_ROOT directory.

I guess the right approach is build/sbt unidoc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6596) fix the instruction on building scaladoc

2015-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385993#comment-14385993
 ] 

Apache Spark commented on SPARK-6596:
-

User 'CodingCat' has created a pull request for this issue:
https://github.com/apache/spark/pull/5253

 fix the instruction on building scaladoc 
 -

 Key: SPARK-6596
 URL: https://issues.apache.org/jira/browse/SPARK-6596
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Nan Zhu

 In README.md under docs/ directory, it says that 
 You can build just the Spark scaladoc by running build/sbt doc from the 
 SPARK_PROJECT_ROOT directory.
 I guess the right approach is build/sbt unidoc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6596) fix the instruction on building scaladoc

2015-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6596:
---

Assignee: (was: Apache Spark)

 fix the instruction on building scaladoc 
 -

 Key: SPARK-6596
 URL: https://issues.apache.org/jira/browse/SPARK-6596
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Nan Zhu

 In README.md under docs/ directory, it says that 
 You can build just the Spark scaladoc by running build/sbt doc from the 
 SPARK_PROJECT_ROOT directory.
 I guess the right approach is build/sbt unidoc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6595) DataFrame self joins with MetastoreRelations fail

2015-03-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6595:

Target Version/s: 1.3.1, 1.4.0  (was: 1.3.1)

 DataFrame self joins with MetastoreRelations fail
 -

 Key: SPARK-6595
 URL: https://issues.apache.org/jira/browse/SPARK-6595
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6596) fix the instruction on building scaladoc

2015-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6596:
---

Assignee: Apache Spark

 fix the instruction on building scaladoc 
 -

 Key: SPARK-6596
 URL: https://issues.apache.org/jira/browse/SPARK-6596
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Nan Zhu
Assignee: Apache Spark

 In README.md under docs/ directory, it says that 
 You can build just the Spark scaladoc by running build/sbt doc from the 
 SPARK_PROJECT_ROOT directory.
 I guess the right approach is build/sbt unidoc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame

2015-03-29 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386009#comment-14386009
 ] 

Liang-Chi Hsieh commented on SPARK-6586:


Not true. Because DataFrame now is given analyzed plan after its many 
operations, {{df.queryExecution.logical}} is analyzed plan instead of the 
original logical plan.

 Add the capability of retrieving original logical plan of DataFrame
 ---

 Key: SPARK-6586
 URL: https://issues.apache.org/jira/browse/SPARK-6586
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor

 In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan 
 instead of logical plan. However, by doing that we can't know the logical 
 plan of a {{DataFrame}}. But it might be still useful and important to 
 retrieve the original logical plan in some use cases.
 In this pr, we introduce the capability of retrieving original logical plan 
 of {{DataFrame}}.
 The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once 
 {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as 
 {{true}}.  In {{QueryExecution}}, we keep the original logical plan in the 
 analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to 
 recursively replace the analyzed logical plan with original logical plan and 
 retrieve it.
 Besides the capability of retrieving original logical plan, this modification 
 also can avoid do plan analysis if it is already analyzed.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame

2015-03-29 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh reopened SPARK-6586:


 Add the capability of retrieving original logical plan of DataFrame
 ---

 Key: SPARK-6586
 URL: https://issues.apache.org/jira/browse/SPARK-6586
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor

 In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan 
 instead of logical plan. However, by doing that we can't know the logical 
 plan of a {{DataFrame}}. But it might be still useful and important to 
 retrieve the original logical plan in some use cases.
 In this pr, we introduce the capability of retrieving original logical plan 
 of {{DataFrame}}.
 The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once 
 {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as 
 {{true}}.  In {{QueryExecution}}, we keep the original logical plan in the 
 analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to 
 recursively replace the analyzed logical plan with original logical plan and 
 retrieve it.
 Besides the capability of retrieving original logical plan, this modification 
 also can avoid do plan analysis if it is already analyzed.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame

2015-03-29 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386009#comment-14386009
 ] 

Liang-Chi Hsieh edited comment on SPARK-6586 at 3/29/15 11:24 PM:
--

Not true. Because DataFrame now is given analyzed plan after its many 
operations, {{df.queryExecution.logical}} is analyzed plan instead of the 
original logical plan.

You can check the pr #5217 for the modification.


was (Author: viirya):
Not true. Because DataFrame now is given analyzed plan after its many 
operations, {{df.queryExecution.logical}} is analyzed plan instead of the 
original logical plan.

 Add the capability of retrieving original logical plan of DataFrame
 ---

 Key: SPARK-6586
 URL: https://issues.apache.org/jira/browse/SPARK-6586
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor

 In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan 
 instead of logical plan. However, by doing that we can't know the logical 
 plan of a {{DataFrame}}. But it might be still useful and important to 
 retrieve the original logical plan in some use cases.
 In this pr, we introduce the capability of retrieving original logical plan 
 of {{DataFrame}}.
 The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once 
 {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as 
 {{true}}.  In {{QueryExecution}}, we keep the original logical plan in the 
 analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to 
 recursively replace the analyzed logical plan with original logical plan and 
 retrieve it.
 Besides the capability of retrieving original logical plan, this modification 
 also can avoid do plan analysis if it is already analyzed.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6597) Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js

2015-03-29 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-6597:
-

 Summary: Replace `input:checkbox` with `input[type=checkbox] in 
additional-metrics.js
 Key: SPARK-6597
 URL: https://issues.apache.org/jira/browse/SPARK-6597
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.2, 1.3.1, 1.4.0
Reporter: Kousuke Saruta
Priority: Minor


In additional-metrics.js, there are some selector notation like 
`input:checkbox` but JQuery's official document says `input[type=checkbox]` 
is better.

https://api.jquery.com/checkbox-selector/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6597) Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js

2015-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6597:
---

Assignee: Apache Spark

 Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js
 --

 Key: SPARK-6597
 URL: https://issues.apache.org/jira/browse/SPARK-6597
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.2, 1.3.1, 1.4.0
Reporter: Kousuke Saruta
Assignee: Apache Spark
Priority: Minor

 In additional-metrics.js, there are some selector notation like 
 `input:checkbox` but JQuery's official document says `input[type=checkbox]` 
 is better.
 https://api.jquery.com/checkbox-selector/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6597) Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js

2015-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6597:
---

Assignee: (was: Apache Spark)

 Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js
 --

 Key: SPARK-6597
 URL: https://issues.apache.org/jira/browse/SPARK-6597
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.2, 1.3.1, 1.4.0
Reporter: Kousuke Saruta
Priority: Minor

 In additional-metrics.js, there are some selector notation like 
 `input:checkbox` but JQuery's official document says `input[type=checkbox]` 
 is better.
 https://api.jquery.com/checkbox-selector/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6597) Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js

2015-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386017#comment-14386017
 ] 

Apache Spark commented on SPARK-6597:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/5254

 Replace `input:checkbox` with `input[type=checkbox] in additional-metrics.js
 --

 Key: SPARK-6597
 URL: https://issues.apache.org/jira/browse/SPARK-6597
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.2, 1.3.1, 1.4.0
Reporter: Kousuke Saruta
Priority: Minor

 In additional-metrics.js, there are some selector notation like 
 `input:checkbox` but JQuery's official document says `input[type=checkbox]` 
 is better.
 https://api.jquery.com/checkbox-selector/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame

2015-03-29 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386020#comment-14386020
 ] 

Michael Armbrust commented on SPARK-6586:
-

Okay, but what is the utility of keeping a fully unresolved plan around?
You are just complicating data frame with a bunch of mutable state.



 Add the capability of retrieving original logical plan of DataFrame
 ---

 Key: SPARK-6586
 URL: https://issues.apache.org/jira/browse/SPARK-6586
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor

 In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan 
 instead of logical plan. However, by doing that we can't know the logical 
 plan of a {{DataFrame}}. But it might be still useful and important to 
 retrieve the original logical plan in some use cases.
 In this pr, we introduce the capability of retrieving original logical plan 
 of {{DataFrame}}.
 The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once 
 {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as 
 {{true}}.  In {{QueryExecution}}, we keep the original logical plan in the 
 analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to 
 recursively replace the analyzed logical plan with original logical plan and 
 retrieve it.
 Besides the capability of retrieving original logical plan, this modification 
 also can avoid do plan analysis if it is already analyzed.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org