[jira] [Commented] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-15 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927539#comment-15927539
 ] 

Kay Ousterhout commented on SPARK-19803:


Awesome thanks!

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Shubham Chopra
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19968) Use a cached instance of KafkaProducer for writing to kafka via KafkaSink.

2017-03-15 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-19968:

Description: 
KafkaProducer is thread safe and an instance can be reused for writing every 
batch out. According to Kafka docs, this sort of usage is encouraged.

On an average an addBatch operation takes 25ms with this patch and 250+ ms 
without this patch.

TODO: Post my results from benchmark comparison. 



  was:
KafkaProducer is thread safe and an instance can be reused for writing every 
batch out. According to Kafka docs, this sort of usage is encouraged.

TODO: Post my results from benchmark comparison. 




> Use a cached instance of KafkaProducer for writing to kafka via KafkaSink.
> --
>
> Key: SPARK-19968
> URL: https://issues.apache.org/jira/browse/SPARK-19968
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>  Labels: kafka
>
> KafkaProducer is thread safe and an instance can be reused for writing every 
> batch out. According to Kafka docs, this sort of usage is encouraged.
> On an average an addBatch operation takes 25ms with this patch and 250+ ms 
> without this patch.
> TODO: Post my results from benchmark comparison. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19968) Use a cached instance of KafkaProducer for writing to kafka via KafkaSink.

2017-03-15 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma reassigned SPARK-19968:
---

Assignee: Prashant Sharma

> Use a cached instance of KafkaProducer for writing to kafka via KafkaSink.
> --
>
> Key: SPARK-19968
> URL: https://issues.apache.org/jira/browse/SPARK-19968
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>  Labels: kafka
>
> KafkaProducer is thread safe and an instance can be reused for writing every 
> batch out. According to Kafka docs, this sort of usage is encouraged.
> TODO: Post my results from benchmark comparison. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19968) Use a cached instance of KafkaProducer for writing to kafka via KafkaSink.

2017-03-15 Thread Prashant Sharma (JIRA)
Prashant Sharma created SPARK-19968:
---

 Summary: Use a cached instance of KafkaProducer for writing to 
kafka via KafkaSink.
 Key: SPARK-19968
 URL: https://issues.apache.org/jira/browse/SPARK-19968
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Prashant Sharma


KafkaProducer is thread safe and an instance can be reused for writing every 
batch out. According to Kafka docs, this sort of usage is encouraged.

TODO: Post my results from benchmark comparison. 





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19964) SparkSubmitSuite fails due to Timeout

2017-03-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927510#comment-15927510
 ] 

Sean Owen commented on SPARK-19964:
---

It doesn't fail in master. This sounds like the kind of thing that fails when 
you have old builds lying around - could be local?

> SparkSubmitSuite fails due to Timeout
> -
>
> Key: SPARK-19964
> URL: https://issues.apache.org/jira/browse/SPARK-19964
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Tests
>Affects Versions: 2.2.0
>Reporter: Eren Avsarogullari
>  Labels: flaky-test
> Attachments: SparkSubmitSuite_Stacktrace
>
>
> The following test case has been failed due to TestFailedDueToTimeoutException
> *Test Suite:* SparkSubmitSuite
> *Test Case:* includes jars passed in through --packages
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/
> *Stacktrace is also attached.*



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19967) Add from_json APIs to SQL

2017-03-15 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19967:

Description: 
The method "from_json" is a useful method in turning a string column into a 
nested StructType with a user specified schema. The schema should be specified 
in the DDL format


  was:The method "from_json" is a useful method in turning a string column into 
a nested StructType with a user specified schema. The schema should be 
specified in the DDL format for declaring the table schema. 


> Add from_json APIs to SQL
> -
>
> Key: SPARK-19967
> URL: https://issues.apache.org/jira/browse/SPARK-19967
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> The method "from_json" is a useful method in turning a string column into a 
> nested StructType with a user specified schema. The schema should be 
> specified in the DDL format



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19898) Add from_json in FunctionRegistry

2017-03-15 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-19898.
---
Resolution: Duplicate

> Add from_json in FunctionRegistry
> -
>
> Key: SPARK-19898
> URL: https://issues.apache.org/jira/browse/SPARK-19898
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In SPARK-19637, `to_json` has been added in FunctionRegistry. As for 
> `from_json`, we do not add yet because how users specify schema as SQL 
> strings is not clear (See https://github.com/apache/spark/pull/16981 for 
> related discussion).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19967) Add from_json APIs to SQL

2017-03-15 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927485#comment-15927485
 ] 

Takeshi Yamamuro commented on SPARK-19967:
--

oh, looks great! Thanks!

> Add from_json APIs to SQL
> -
>
> Key: SPARK-19967
> URL: https://issues.apache.org/jira/browse/SPARK-19967
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> The method "from_json" is a useful method in turning a string column into a 
> nested StructType with a user specified schema. The schema should be 
> specified in the DDL format for declaring the table schema. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19967) Add from_json APIs to SQL

2017-03-15 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927478#comment-15927478
 ] 

Xiao Li commented on SPARK-19967:
-

See the PR https://github.com/apache/spark/pull/17171 It is just merged very 
recently.

> Add from_json APIs to SQL
> -
>
> Key: SPARK-19967
> URL: https://issues.apache.org/jira/browse/SPARK-19967
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> The method "from_json" is a useful method in turning a string column into a 
> nested StructType with a user specified schema. The schema should be 
> specified in the DDL format for declaring the table schema. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19967) Add from_json APIs to SQL

2017-03-15 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927476#comment-15927476
 ] 

Takeshi Yamamuro commented on SPARK-19967:
--

Finally, what's the DDL format for schema? In my previous pr, I just used the 
format parsed by DataType.fromJson()
https://github.com/apache/spark/compare/master...maropu:SPARK-19637-2

> Add from_json APIs to SQL
> -
>
> Key: SPARK-19967
> URL: https://issues.apache.org/jira/browse/SPARK-19967
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> The method "from_json" is a useful method in turning a string column into a 
> nested StructType with a user specified schema. The schema should be 
> specified in the DDL format for declaring the table schema. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19967) Add from_json APIs to SQL

2017-03-15 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927466#comment-15927466
 ] 

Takeshi Yamamuro commented on SPARK-19967:
--

Yea, sure! I'll make a pr in a day! Thanks.
BTW I've already file a JIRA for this a few days ago: 
https://issues.apache.org/jira/browse/SPARK-19898

> Add from_json APIs to SQL
> -
>
> Key: SPARK-19967
> URL: https://issues.apache.org/jira/browse/SPARK-19967
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> The method "from_json" is a useful method in turning a string column into a 
> nested StructType with a user specified schema. The schema should be 
> specified in the DDL format for declaring the table schema. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19967) Add from_json APIs to SQL

2017-03-15 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927462#comment-15927462
 ] 

Xiao Li commented on SPARK-19967:
-

cc  [~maropu]

> Add from_json APIs to SQL
> -
>
> Key: SPARK-19967
> URL: https://issues.apache.org/jira/browse/SPARK-19967
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> The method "from_json" is a useful method in turning a string column into a 
> nested StructType with a user specified schema. The schema should be 
> specified in the DDL format for declaring the table schema. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19967) Add from_json APIs to SQL

2017-03-15 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19967:

Description: The method "from_json" is a useful method in turning a string 
column into a nested StructType with a user specified schema. The schema should 
be specified in the DDL format for declaring the table schema. 

> Add from_json APIs to SQL
> -
>
> Key: SPARK-19967
> URL: https://issues.apache.org/jira/browse/SPARK-19967
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> The method "from_json" is a useful method in turning a string column into a 
> nested StructType with a user specified schema. The schema should be 
> specified in the DDL format for declaring the table schema. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19967) Add from_json APIs to SQL

2017-03-15 Thread Xiao Li (JIRA)
Xiao Li created SPARK-19967:
---

 Summary: Add from_json APIs to SQL
 Key: SPARK-19967
 URL: https://issues.apache.org/jira/browse/SPARK-19967
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xiao Li






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19830) Add parseTableSchema API to ParserInterface

2017-03-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19830.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17171
[https://github.com/apache/spark/pull/17171]

> Add parseTableSchema API to ParserInterface
> ---
>
> Key: SPARK-19830
> URL: https://issues.apache.org/jira/browse/SPARK-19830
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> Specifying the table schema in DDL formats is needed for different scenarios. 
> For example, specifying the schema in SQL function {{from_json}}, and 
> specifying the customized JDBC data types. In the submitted PRs, the idea is 
> to ask users to specify the table schema in the JSON format. This is not user 
> friendly. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19965) DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output

2017-03-15 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927431#comment-15927431
 ] 

Liwei Lin commented on SPARK-19965:
---

Hi [~zsxwing], are you working on a patch? Mind if I work on this in case you 
haven't started your work?

> DataFrame batch reader may fail to infer partitions when reading 
> FileStreamSink's output
> 
>
> Key: SPARK-19965
> URL: https://issues.apache.org/jira/browse/SPARK-19965
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>
> Reproducer
> {code}
>   test("partitioned writing and batch reading with 'basePath'") {
> val inputData = MemoryStream[Int]
> val ds = inputData.toDS()
> val outputDir = Utils.createTempDir(namePrefix = 
> "stream.output").getCanonicalPath
> val checkpointDir = Utils.createTempDir(namePrefix = 
> "stream.checkpoint").getCanonicalPath
> var query: StreamingQuery = null
> try {
>   query =
> ds.map(i => (i, i * 1000))
>   .toDF("id", "value")
>   .writeStream
>   .partitionBy("id")
>   .option("checkpointLocation", checkpointDir)
>   .format("parquet")
>   .start(outputDir)
>   inputData.addData(1, 2, 3)
>   failAfter(streamingTimeout) {
> query.processAllAvailable()
>   }
>   spark.read.option("basePath", outputDir).parquet(outputDir + 
> "/*").show()
> } finally {
>   if (query != null) {
> query.stop()
>   }
> }
>   }
> {code}
> Stack trace
> {code}
> [info] - partitioned writing and batch reading with 'basePath' *** FAILED *** 
> (3 seconds, 928 milliseconds)
> [info]   java.lang.AssertionError: assertion failed: Conflicting directory 
> structures detected. Suspicious paths:
> [info]***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637
> [info]
> ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata
> [info] 
> [info] If provided paths are partition directories, please set "basePath" in 
> the options of the data source to specify the root directory of the table. If 
> there are multiple root directories, please load them separately and then 
> union them.
> [info]   at scala.Predef$.assert(Predef.scala:170)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156)
> [info]   at 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
> [info]   at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160)
> [info]   at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536)
> [info]   at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520)
> [info]   at 
> org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292)
> [info]   at 
> org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
> [info]   at 
> org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18591) Replace hash-based aggregates with sort-based ones if inputs already sorted

2017-03-15 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927409#comment-15927409
 ] 

Takeshi Yamamuro commented on SPARK-18591:
--

yea, sure. Have we already filed a JIRA for the planner improvement?

> Replace hash-based aggregates with sort-based ones if inputs already sorted
> ---
>
> Key: SPARK-18591
> URL: https://issues.apache.org/jira/browse/SPARK-18591
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Takeshi Yamamuro
>
> Spark currently uses sort-based aggregates only in limited condition; the 
> cases where spark cannot use partial aggregates and hash-based ones.
> However, if input ordering has already satisfied the requirements of 
> sort-based aggregates, it seems sort-based ones are faster than the other.
> {code}
> ./bin/spark-shell --conf spark.sql.shuffle.partitions=1
> val df = spark.range(1000).selectExpr("id AS key", "id % 10 AS 
> value").sort($"key").cache
> def timer[R](block: => R): R = {
>   val t0 = System.nanoTime()
>   val result = block
>   val t1 = System.nanoTime()
>   println("Elapsed time: " + ((t1 - t0 + 0.0) / 10.0)+ "s")
>   result
> }
> timer {
>   df.groupBy("key").count().count
> }
> // codegen'd hash aggregate
> Elapsed time: 7.116962977s
> // non-codegen'd sort aggregarte
> Elapsed time: 3.088816662s
> {code}
> If codegen'd sort-based aggregates are supported in SPARK-16844, this seems 
> to make the performance gap bigger;
> {code}
> - codegen'd sort aggregate
> Elapsed time: 1.645234684s
> {code} 
> Therefore, it'd be better to use sort-based ones in this case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-15 Thread Shubham Chopra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927396#comment-15927396
 ] 

Shubham Chopra commented on SPARK-19803:


I am looking into this and will try to submit a fix in a day or so. Mostly 
trying to isolate the race condition and simplify the test cases. 

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Shubham Chopra
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18591) Replace hash-based aggregates with sort-based ones if inputs already sorted

2017-03-15 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927388#comment-15927388
 ] 

Wenchen Fan commented on SPARK-18591:
-

Can we hold it for a while? I think we need to improve the planner first. We 
should either move back to bottom-up planning, or add a physical optimization 
phase. We need to figure out a better way to do planning

> Replace hash-based aggregates with sort-based ones if inputs already sorted
> ---
>
> Key: SPARK-18591
> URL: https://issues.apache.org/jira/browse/SPARK-18591
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Takeshi Yamamuro
>
> Spark currently uses sort-based aggregates only in limited condition; the 
> cases where spark cannot use partial aggregates and hash-based ones.
> However, if input ordering has already satisfied the requirements of 
> sort-based aggregates, it seems sort-based ones are faster than the other.
> {code}
> ./bin/spark-shell --conf spark.sql.shuffle.partitions=1
> val df = spark.range(1000).selectExpr("id AS key", "id % 10 AS 
> value").sort($"key").cache
> def timer[R](block: => R): R = {
>   val t0 = System.nanoTime()
>   val result = block
>   val t1 = System.nanoTime()
>   println("Elapsed time: " + ((t1 - t0 + 0.0) / 10.0)+ "s")
>   result
> }
> timer {
>   df.groupBy("key").count().count
> }
> // codegen'd hash aggregate
> Elapsed time: 7.116962977s
> // non-codegen'd sort aggregarte
> Elapsed time: 3.088816662s
> {code}
> If codegen'd sort-based aggregates are supported in SPARK-16844, this seems 
> to make the performance gap bigger;
> {code}
> - codegen'd sort aggregate
> Elapsed time: 1.645234684s
> {code} 
> Therefore, it'd be better to use sort-based ones in this case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7420) Flaky test: o.a.s.streaming.JobGeneratorSuite "Do not clear received block data too soon"

2017-03-15 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927385#comment-15927385
 ] 

Imran Rashid commented on SPARK-7420:
-

This just [failed 
again|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74629/testReport/org.apache.spark.streaming.scheduler/JobGeneratorSuite/SPARK_6222__Do_not_clear_received_block_data_too_soon/]


{noformat}
sbt.ForkMain$ForkError: 
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 107 times over 10.24840867699 
seconds. Last failure message: getBlocksOfBatch(batchTime).nonEmpty was false.
at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:336)
at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
at 
org.apache.spark.streaming.scheduler.JobGeneratorSuite$$anonfun$1$$anonfun$apply$mcV$sp$1.org$apache$spark$streaming$scheduler$JobGeneratorSuite$$anonfun$$anonfun$$waitForBlocksToBeAllocatedToBatch$1(JobGeneratorSuite.scala:105)
at 
org.apache.spark.streaming.scheduler.JobGeneratorSuite$$anonfun$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply$mcVI$sp(JobGeneratorSuite.scala:114)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at 
org.apache.spark.streaming.scheduler.JobGeneratorSuite$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(JobGeneratorSuite.scala:111)
at 
org.apache.spark.streaming.scheduler.JobGeneratorSuite$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(JobGeneratorSuite.scala:66)
at 
org.apache.spark.streaming.TestSuiteBase$class.withStreamingContext(TestSuiteBase.scala:279)
at ...
{noformat}

I'll also attach the relevant snippet from {{streaming/unit-test.log}}

> Flaky test: o.a.s.streaming.JobGeneratorSuite "Do not clear received block 
> data too soon"
> -
>
> Key: SPARK-7420
> URL: https://issues.apache.org/jira/browse/SPARK-7420
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Andrew Or
>Assignee: Tathagata Das
>Priority: Critical
>  Labels: flaky-test
> Attachments: trimmed-unit-tests-74629.log
>
>
> {code}
> The code passed to eventually never returned normally. Attempted 18 times 
> over 10.13803606001 seconds. Last failure message: 
> receiverTracker.hasUnallocatedBlocks was false.
> {code}
> It seems to be failing only in maven.
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/458/
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/459/
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/2173/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7420) Flaky test: o.a.s.streaming.JobGeneratorSuite "Do not clear received block data too soon"

2017-03-15 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-7420:

Attachment: trimmed-unit-tests-74629.log

> Flaky test: o.a.s.streaming.JobGeneratorSuite "Do not clear received block 
> data too soon"
> -
>
> Key: SPARK-7420
> URL: https://issues.apache.org/jira/browse/SPARK-7420
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Andrew Or
>Assignee: Tathagata Das
>Priority: Critical
>  Labels: flaky-test
> Attachments: trimmed-unit-tests-74629.log
>
>
> {code}
> The code passed to eventually never returned normally. Attempted 18 times 
> over 10.13803606001 seconds. Last failure message: 
> receiverTracker.hasUnallocatedBlocks was false.
> {code}
> It seems to be failing only in maven.
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/458/
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/459/
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/2173/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-7420) Flaky test: o.a.s.streaming.JobGeneratorSuite "Do not clear received block data too soon"

2017-03-15 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reopened SPARK-7420:
-
  Assignee: (was: Tathagata Das)

> Flaky test: o.a.s.streaming.JobGeneratorSuite "Do not clear received block 
> data too soon"
> -
>
> Key: SPARK-7420
> URL: https://issues.apache.org/jira/browse/SPARK-7420
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Andrew Or
>Priority: Critical
>  Labels: flaky-test
> Attachments: trimmed-unit-tests-74629.log
>
>
> {code}
> The code passed to eventually never returned normally. Attempted 18 times 
> over 10.13803606001 seconds. Last failure message: 
> receiverTracker.hasUnallocatedBlocks was false.
> {code}
> It seems to be failing only in maven.
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/458/
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/459/
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/2173/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19966) SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate

2017-03-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-19966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

吴志龙 updated SPARK-19966:

Environment: 
redhot 
spark on yarn client


  was:
redhot spark on yarn client



> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate
> ---
>
> Key: SPARK-19966
> URL: https://issues.apache.org/jira/browse/SPARK-19966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: redhot 
> spark on yarn client
>Reporter: 吴志龙
>
> 2017-03-16 10:17:47|GET_HIVE_DATA|delete table |drop table 
> dp_tmp.tmp_20783_20170316;
> ERROR|2017-03-16 10:17:44|LiveListenerBus|SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(3,WrappedArray())
> >FAILED|GET_HIVE_DATA|[[[0]]] ,has execption



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19966) SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate

2017-03-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-19966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

吴志龙 updated SPARK-19966:

Environment: 
redhot spark on yarn client


  was:
hadoop 2.6.0
jdk 1.7
spark 2.1 
spark on yarn client



> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate
> ---
>
> Key: SPARK-19966
> URL: https://issues.apache.org/jira/browse/SPARK-19966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: redhot spark on yarn client
>Reporter: 吴志龙
>
> 2017-03-16 10:17:47|GET_HIVE_DATA|delete table |drop table 
> dp_tmp.tmp_20783_20170316;
> ERROR|2017-03-16 10:17:44|LiveListenerBus|SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(3,WrappedArray())
> >FAILED|GET_HIVE_DATA|[[[0]]] ,has execption



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19966) SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate

2017-03-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-19966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

吴志龙 updated SPARK-19966:

Description: 
2017-03-16 10:17:47|GET_HIVE_DATA|delete table |drop table 
dp_tmp.tmp_20783_20170316;
ERROR|2017-03-16 10:17:44|LiveListenerBus|SparkListenerBus has already stopped! 
Dropping event SparkListenerExecutorMetricsUpdate(3,WrappedArray())
>FAILED|GET_HIVE_DATA|[[[0]]] ,has execption

  was:
2017-03-16 10:17:47|GET_HIVE_DATA|删除临时表|drop table dp_tmp.tmp_20783_20170316;
ERROR|2017-03-16 10:17:44|LiveListenerBus|SparkListenerBus has already stopped! 
Dropping event SparkListenerExecutorMetricsUpdate(3,WrappedArray())
>FAILED|GET_HIVE_DATA|[[[0]]]条,有异常


> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate
> ---
>
> Key: SPARK-19966
> URL: https://issues.apache.org/jira/browse/SPARK-19966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: hadoop 2.6.0
> jdk 1.7
> spark 2.1 
> spark on yarn client
>Reporter: 吴志龙
>
> 2017-03-16 10:17:47|GET_HIVE_DATA|delete table |drop table 
> dp_tmp.tmp_20783_20170316;
> ERROR|2017-03-16 10:17:44|LiveListenerBus|SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(3,WrappedArray())
> >FAILED|GET_HIVE_DATA|[[[0]]] ,has execption



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19966) SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate

2017-03-15 Thread JIRA
吴志龙 created SPARK-19966:
---

 Summary: SparkListenerBus has already stopped! Dropping event 
SparkListenerExecutorMetricsUpdate
 Key: SPARK-19966
 URL: https://issues.apache.org/jira/browse/SPARK-19966
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
 Environment: hadoop 2.6.0
jdk 1.7
spark 2.1 
spark on yarn client

Reporter: 吴志龙


2017-03-16 10:17:47|GET_HIVE_DATA|删除临时表|drop table dp_tmp.tmp_20783_20170316;
ERROR|2017-03-16 10:17:44|LiveListenerBus|SparkListenerBus has already stopped! 
Dropping event SparkListenerExecutorMetricsUpdate(3,WrappedArray())
>FAILED|GET_HIVE_DATA|[[[0]]]条,有异常



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19965) DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output

2017-03-15 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927333#comment-15927333
 ] 

Shixiong Zhu commented on SPARK-19965:
--

This is because inferring partitions doesn't ignore the "_spark_metadata" 
folder.

> DataFrame batch reader may fail to infer partitions when reading 
> FileStreamSink's output
> 
>
> Key: SPARK-19965
> URL: https://issues.apache.org/jira/browse/SPARK-19965
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>
> Reproducer
> {code}
>   test("partitioned writing and batch reading with 'basePath'") {
> val inputData = MemoryStream[Int]
> val ds = inputData.toDS()
> val outputDir = Utils.createTempDir(namePrefix = 
> "stream.output").getCanonicalPath
> val checkpointDir = Utils.createTempDir(namePrefix = 
> "stream.checkpoint").getCanonicalPath
> var query: StreamingQuery = null
> try {
>   query =
> ds.map(i => (i, i * 1000))
>   .toDF("id", "value")
>   .writeStream
>   .partitionBy("id")
>   .option("checkpointLocation", checkpointDir)
>   .format("parquet")
>   .start(outputDir)
>   inputData.addData(1, 2, 3)
>   failAfter(streamingTimeout) {
> query.processAllAvailable()
>   }
>   spark.read.option("basePath", outputDir).parquet(outputDir + 
> "/*").show()
> } finally {
>   if (query != null) {
> query.stop()
>   }
> }
>   }
> {code}
> Stack trace
> {code}
> [info] - partitioned writing and batch reading with 'basePath' *** FAILED *** 
> (3 seconds, 928 milliseconds)
> [info]   java.lang.AssertionError: assertion failed: Conflicting directory 
> structures detected. Suspicious paths:
> [info]***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637
> [info]
> ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata
> [info] 
> [info] If provided paths are partition directories, please set "basePath" in 
> the options of the data source to specify the root directory of the table. If 
> there are multiple root directories, please load them separately and then 
> union them.
> [info]   at scala.Predef$.assert(Predef.scala:170)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156)
> [info]   at 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
> [info]   at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160)
> [info]   at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536)
> [info]   at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520)
> [info]   at 
> org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292)
> [info]   at 
> org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
> [info]   at 
> org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19965) DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output

2017-03-15 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-19965:


 Summary: DataFrame batch reader may fail to infer partitions when 
reading FileStreamSink's output
 Key: SPARK-19965
 URL: https://issues.apache.org/jira/browse/SPARK-19965
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Shixiong Zhu


Reproducer
{code}
  test("partitioned writing and batch reading with 'basePath'") {
val inputData = MemoryStream[Int]
val ds = inputData.toDS()

val outputDir = Utils.createTempDir(namePrefix = 
"stream.output").getCanonicalPath
val checkpointDir = Utils.createTempDir(namePrefix = 
"stream.checkpoint").getCanonicalPath

var query: StreamingQuery = null

try {
  query =
ds.map(i => (i, i * 1000))
  .toDF("id", "value")
  .writeStream
  .partitionBy("id")
  .option("checkpointLocation", checkpointDir)
  .format("parquet")
  .start(outputDir)

  inputData.addData(1, 2, 3)
  failAfter(streamingTimeout) {
query.processAllAvailable()
  }

  spark.read.option("basePath", outputDir).parquet(outputDir + "/*").show()
} finally {
  if (query != null) {
query.stop()
  }
}
  }
{code}

Stack trace
{code}
[info] - partitioned writing and batch reading with 'basePath' *** FAILED *** 
(3 seconds, 928 milliseconds)
[info]   java.lang.AssertionError: assertion failed: Conflicting directory 
structures detected. Suspicious paths:
[info]  ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637
[info]  ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata
[info] 
[info] If provided paths are partition directories, please set "basePath" in 
the options of the data source to specify the root directory of the table. If 
there are multiple root directories, please load them separately and then union 
them.
[info]   at scala.Predef$.assert(Predef.scala:170)
[info]   at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133)
[info]   at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98)
[info]   at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156)
[info]   at 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54)
[info]   at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55)
[info]   at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133)
[info]   at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
[info]   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160)
[info]   at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536)
[info]   at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520)
[info]   at 
org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292)
[info]   at 
org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
[info]   at 
org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-15 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927263#comment-15927263
 ] 

Kay Ousterhout commented on SPARK-19803:


This failed again today: 

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74621/testReport/org.apache.spark.storage/BlockManagerProactiveReplicationSuite/proactive_block_replication___3_replicas___2_block_manager_deletions/

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Shubham Chopra
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19751) Create Data frame API fails with a self referencing bean

2017-03-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19751.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17188
[https://github.com/apache/spark/pull/17188]

> Create Data frame API fails with a self referencing bean
> 
>
> Key: SPARK-19751
> URL: https://issues.apache.org/jira/browse/SPARK-19751
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Avinash Venkateshaiah
>Priority: Minor
> Fix For: 2.2.0
>
>
> createDataset API throws a stack overflow exception when we try creating a 
> Dataset using a bean encoder. The bean is self referencing
> BEAN:
> public class HierObj implements Serializable {
> String name;
> List children;
> public String getName() {
> return name;
> }
> public void setName(String name) {
> this.name = name;
> }
> public List getChildren() {
> return children;
> }
> public void setChildren(List children) {
> this.children = children;
> }
> }
> // create an object
> HierObj hierObj = new HierObj();
> hierObj.setName("parent");
> List children = new ArrayList();
> HierObj child1 = new HierObj();
> child1.setName("child1");
> HierObj child2 = new HierObj();
> child2.setName("child2");
> children.add(child1);
> children.add(child2);
> hierObj.setChildren(children);
> // create a dataset
> Dataset ds = sparkSession().createDataset(Arrays.asList(hierObj), 
> Encoders.bean(HierObj.class));



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19751) Create Data frame API fails with a self referencing bean

2017-03-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19751:
---

Assignee: Takeshi Yamamuro

> Create Data frame API fails with a self referencing bean
> 
>
> Key: SPARK-19751
> URL: https://issues.apache.org/jira/browse/SPARK-19751
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Avinash Venkateshaiah
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.2.0
>
>
> createDataset API throws a stack overflow exception when we try creating a 
> Dataset using a bean encoder. The bean is self referencing
> BEAN:
> public class HierObj implements Serializable {
> String name;
> List children;
> public String getName() {
> return name;
> }
> public void setName(String name) {
> this.name = name;
> }
> public List getChildren() {
> return children;
> }
> public void setChildren(List children) {
> this.children = children;
> }
> }
> // create an object
> HierObj hierObj = new HierObj();
> hierObj.setName("parent");
> List children = new ArrayList();
> HierObj child1 = new HierObj();
> child1.setName("child1");
> HierObj child2 = new HierObj();
> child2.setName("child2");
> children.add(child1);
> children.add(child2);
> hierObj.setChildren(children);
> // create a dataset
> Dataset ds = sparkSession().createDataset(Arrays.asList(hierObj), 
> Encoders.bean(HierObj.class));



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19961) unify a exception erro msg for dropdatabase

2017-03-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19961.
-
   Resolution: Fixed
 Assignee: Song Jun
Fix Version/s: 2.2.0

> unify a exception erro msg for dropdatabase
> ---
>
> Key: SPARK-19961
> URL: https://issues.apache.org/jira/browse/SPARK-19961
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Song Jun
>Assignee: Song Jun
>Priority: Trivial
> Fix For: 2.2.0
>
>
> unify a exception erro msg for dropdatabase when the database still have some 
> tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19948) Document that saveAsTable uses catalog as source of truth for table existence.

2017-03-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19948:
---

Assignee: Juliusz Sompolski

> Document that saveAsTable uses catalog as source of truth for table existence.
> --
>
> Key: SPARK-19948
> URL: https://issues.apache.org/jira/browse/SPARK-19948
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
> Fix For: 2.2.0
>
>
> It is quirky behaviour that saveAsTable to e.g. a JDBC source with SaveMode 
> other than Overwrite will nevertheless overwrite the table in the external 
> source,if that table was not a catalog table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19948) Document that saveAsTable uses catalog as source of truth for table existence.

2017-03-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19948.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17289
[https://github.com/apache/spark/pull/17289]

> Document that saveAsTable uses catalog as source of truth for table existence.
> --
>
> Key: SPARK-19948
> URL: https://issues.apache.org/jira/browse/SPARK-19948
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Juliusz Sompolski
> Fix For: 2.2.0
>
>
> It is quirky behaviour that saveAsTable to e.g. a JDBC source with SaveMode 
> other than Overwrite will nevertheless overwrite the table in the external 
> source,if that table was not a catalog table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19931) InMemoryTableScanExec should rewrite output partitioning and ordering when aliasing output attributes

2017-03-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19931:
---

Assignee: Liang-Chi Hsieh

> InMemoryTableScanExec should rewrite output partitioning and ordering when 
> aliasing output attributes
> -
>
> Key: SPARK-19931
> URL: https://issues.apache.org/jira/browse/SPARK-19931
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.2.0
>
>
> Now InMemoryTableScanExec simply takes the outputPartitioning and 
> outputOrdering from the associated InMemoryRelation's 
> child.outputPartitioning and outputOrdering.
> However, InMemoryTableScanExec can alias the output attributes. In this case, 
> its outputPartitioning and outputOrdering are not correct and its parent 
> operators can't correctly determine its data distribution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19931) InMemoryTableScanExec should rewrite output partitioning and ordering when aliasing output attributes

2017-03-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19931.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17175
[https://github.com/apache/spark/pull/17175]

> InMemoryTableScanExec should rewrite output partitioning and ordering when 
> aliasing output attributes
> -
>
> Key: SPARK-19931
> URL: https://issues.apache.org/jira/browse/SPARK-19931
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Liang-Chi Hsieh
> Fix For: 2.2.0
>
>
> Now InMemoryTableScanExec simply takes the outputPartitioning and 
> outputOrdering from the associated InMemoryRelation's 
> child.outputPartitioning and outputOrdering.
> However, InMemoryTableScanExec can alias the output attributes. In this case, 
> its outputPartitioning and outputOrdering are not correct and its parent 
> operators can't correctly determine its data distribution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)

2017-03-15 Thread Yash Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927240#comment-15927240
 ] 

Yash Sharma commented on SPARK-7736:


This does not seem Fixed. The application still completes with SUCCESS status 
even when an exception is thrown from the application.
Spark version 2.0.2.

> Exception not failing Python applications (in yarn cluster mode)
> 
>
> Key: SPARK-7736
> URL: https://issues.apache.org/jira/browse/SPARK-7736
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
> Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04
>Reporter: Shay Rojansky
>Assignee: Marcelo Vanzin
> Fix For: 1.5.1, 1.6.0
>
>
> It seems that exceptions thrown in Python spark apps after the SparkContext 
> is instantiated don't cause the application to fail, at least in Yarn: the 
> application is marked as SUCCEEDED.
> Note that any exception right before the SparkContext correctly places the 
> application in FAILED state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18066) Add Pool usage policies test coverage for FIFO & FAIR Schedulers

2017-03-15 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-18066.

   Resolution: Fixed
 Assignee: Eren Avsarogullari
Fix Version/s: 2.2.0

> Add Pool usage policies test coverage for FIFO & FAIR Schedulers
> 
>
> Key: SPARK-18066
> URL: https://issues.apache.org/jira/browse/SPARK-18066
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Eren Avsarogullari
>Assignee: Eren Avsarogullari
>Priority: Minor
> Fix For: 2.2.0
>
>
> The following Pool usage cases need to have Unit test coverage :
> - FIFO Scheduler just uses *root pool* so even if *spark.scheduler.pool* 
> property is set, related pool is not created and *TaskSetManagers* are added 
> to root pool.
> - FAIR Scheduler uses default pool when *spark.scheduler.pool* property is 
> not set. This can be happened when Properties object is null or empty(*new 
> Properties()*) or points default pool(*spark.scheduler.pool*=_default_).
> - FAIR Scheduler creates a new pool with default values when 
> *spark.scheduler.pool* property points _non-existent_ pool. This can be 
> happened when scheduler allocation file is not set or it does not contain 
> related pool.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition

2017-03-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927122#comment-15927122
 ] 

Hyukjin Kwon commented on SPARK-19872:
--

Yup, test was added.

> UnicodeDecodeError in Pyspark on sc.textFile read with repartition
> --
>
> Key: SPARK-19872
> URL: https://issues.apache.org/jira/browse/SPARK-19872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Mac and EC2
>Reporter: Brian Bruggeman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.1, 2.2.0
>
>
> I'm receiving the following traceback:
> {code}
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> I created a textfile (text.txt) with standard linux newlines:
> {code}
> a
> b
> d
> e
> f
> g
> h
> i
> j
> k
> l
> {code}
> I think ran pyspark:
> {code}
> $ pyspark
> Python 2.7.13 (default, Dec 18 2016, 07:03:39)
> [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
> SparkSession available as 'spark'.
> >>> sc.textFile('test.txt').collect()
> [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
> >>> sc.textFile('test.txt', use_unicode=False).collect()
> ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
> >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect()
> ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', 
> '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> This really looks like a bug in the `serializers.py` code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.

2017-03-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927111#comment-15927111
 ] 

Hyukjin Kwon commented on SPARK-19954:
--

Honestly, we don't usually set {{Blocker}} 

{quote}
Priority. Set to Major or below; higher priorities are generally reserved for 
committers to set
{quote}

I don't want to repeat what is written in the guide lines multiple times in the 
future.


> Joining to a unioned DataFrame does not produce expected result.
> 
>
> Key: SPARK-19954
> URL: https://issues.apache.org/jira/browse/SPARK-19954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Arun Allamsetty
>Priority: Blocker
>
> I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is 
> that when we try to join two DataFrames, one of which is a result of a union 
> operation, the result of the join results in data as if the table was joined 
> only to the first table in the union. This issue is not present in Spark 
> 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it.
> {noformat}
> import spark.implicits._
> import org.apache.spark.sql.functions.lit
> case class A(id: Long, colA: Boolean)
> case class B(id: Long, colB: Int)
> case class C(id: Long, colC: Double)
> case class X(id: Long, name: String)
> val aData = A(1, true) :: Nil
> val bData = B(2, 10) :: Nil
> val cData = C(3, 9.73D) :: Nil
> val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil
> val aDf = spark.createDataset(aData).toDF
> val bDf = spark.createDataset(bData).toDF
> val cDf = spark.createDataset(cData).toDF
> val xDf = spark.createDataset(xData).toDF
> val unionDf =
>   aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), 
> lit(null).as("colC")).union(
>   bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", 
> lit(null).as("colC"))).union(
>   cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), 
> lit(null).as("colB"), $"colC"))
> val result = xDf.join(unionDf, unionDf("name") === xDf("name") && 
> unionDf("id") === xDf("id"))
> result.show
> {noformat}
> The result being
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> +---++---+++++
> {noformat}
> Force computing {{unionDf}} using {{count}} does not help change the result 
> of the join. However, writing the data to disk and reading it back does give 
> the correct result. But it is definitely not ideal. Interestingly caching the 
> {{unionDf}} also gives the correct result.
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> |  2|   b|  2|   b|null|  10|null|
> |  3|   c|  3|   c|null|null|9.73|
> +---++---+++++
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.

2017-03-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927103#comment-15927103
 ] 

Hyukjin Kwon commented on SPARK-19954:
--

I was following "Contributing to JIRA Maintenance" 
http://spark.apache.org/contributing.html

{quote}
For issues that can’t be reproduced against master as reported, resolve as 
Cannot Reproduce
{quote}

Now I can identify it. I am resolving this as {{Duplicate}}  as written

{quote}
If the issue is the same as or a subset of another issue, resolved as Duplicate
{quote}


> Joining to a unioned DataFrame does not produce expected result.
> 
>
> Key: SPARK-19954
> URL: https://issues.apache.org/jira/browse/SPARK-19954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Arun Allamsetty
>Priority: Blocker
>
> I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is 
> that when we try to join two DataFrames, one of which is a result of a union 
> operation, the result of the join results in data as if the table was joined 
> only to the first table in the union. This issue is not present in Spark 
> 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it.
> {noformat}
> import spark.implicits._
> import org.apache.spark.sql.functions.lit
> case class A(id: Long, colA: Boolean)
> case class B(id: Long, colB: Int)
> case class C(id: Long, colC: Double)
> case class X(id: Long, name: String)
> val aData = A(1, true) :: Nil
> val bData = B(2, 10) :: Nil
> val cData = C(3, 9.73D) :: Nil
> val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil
> val aDf = spark.createDataset(aData).toDF
> val bDf = spark.createDataset(bData).toDF
> val cDf = spark.createDataset(cData).toDF
> val xDf = spark.createDataset(xData).toDF
> val unionDf =
>   aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), 
> lit(null).as("colC")).union(
>   bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", 
> lit(null).as("colC"))).union(
>   cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), 
> lit(null).as("colB"), $"colC"))
> val result = xDf.join(unionDf, unionDf("name") === xDf("name") && 
> unionDf("id") === xDf("id"))
> result.show
> {noformat}
> The result being
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> +---++---+++++
> {noformat}
> Force computing {{unionDf}} using {{count}} does not help change the result 
> of the join. However, writing the data to disk and reading it back does give 
> the correct result. But it is definitely not ideal. Interestingly caching the 
> {{unionDf}} also gives the correct result.
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> |  2|   b|  2|   b|null|  10|null|
> |  3|   c|  3|   c|null|null|9.73|
> +---++---+++++
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.

2017-03-15 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-19954:
--

> Joining to a unioned DataFrame does not produce expected result.
> 
>
> Key: SPARK-19954
> URL: https://issues.apache.org/jira/browse/SPARK-19954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Arun Allamsetty
>Priority: Blocker
>
> I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is 
> that when we try to join two DataFrames, one of which is a result of a union 
> operation, the result of the join results in data as if the table was joined 
> only to the first table in the union. This issue is not present in Spark 
> 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it.
> {noformat}
> import spark.implicits._
> import org.apache.spark.sql.functions.lit
> case class A(id: Long, colA: Boolean)
> case class B(id: Long, colB: Int)
> case class C(id: Long, colC: Double)
> case class X(id: Long, name: String)
> val aData = A(1, true) :: Nil
> val bData = B(2, 10) :: Nil
> val cData = C(3, 9.73D) :: Nil
> val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil
> val aDf = spark.createDataset(aData).toDF
> val bDf = spark.createDataset(bData).toDF
> val cDf = spark.createDataset(cData).toDF
> val xDf = spark.createDataset(xData).toDF
> val unionDf =
>   aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), 
> lit(null).as("colC")).union(
>   bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", 
> lit(null).as("colC"))).union(
>   cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), 
> lit(null).as("colB"), $"colC"))
> val result = xDf.join(unionDf, unionDf("name") === xDf("name") && 
> unionDf("id") === xDf("id"))
> result.show
> {noformat}
> The result being
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> +---++---+++++
> {noformat}
> Force computing {{unionDf}} using {{count}} does not help change the result 
> of the join. However, writing the data to disk and reading it back does give 
> the correct result. But it is definitely not ideal. Interestingly caching the 
> {{unionDf}} also gives the correct result.
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> |  2|   b|  2|   b|null|  10|null|
> |  3|   c|  3|   c|null|null|9.73|
> +---++---+++++
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.

2017-03-15 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19954.
--
Resolution: Duplicate

> Joining to a unioned DataFrame does not produce expected result.
> 
>
> Key: SPARK-19954
> URL: https://issues.apache.org/jira/browse/SPARK-19954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Arun Allamsetty
>Priority: Blocker
>
> I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is 
> that when we try to join two DataFrames, one of which is a result of a union 
> operation, the result of the join results in data as if the table was joined 
> only to the first table in the union. This issue is not present in Spark 
> 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it.
> {noformat}
> import spark.implicits._
> import org.apache.spark.sql.functions.lit
> case class A(id: Long, colA: Boolean)
> case class B(id: Long, colB: Int)
> case class C(id: Long, colC: Double)
> case class X(id: Long, name: String)
> val aData = A(1, true) :: Nil
> val bData = B(2, 10) :: Nil
> val cData = C(3, 9.73D) :: Nil
> val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil
> val aDf = spark.createDataset(aData).toDF
> val bDf = spark.createDataset(bData).toDF
> val cDf = spark.createDataset(cData).toDF
> val xDf = spark.createDataset(xData).toDF
> val unionDf =
>   aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), 
> lit(null).as("colC")).union(
>   bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", 
> lit(null).as("colC"))).union(
>   cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), 
> lit(null).as("colB"), $"colC"))
> val result = xDf.join(unionDf, unionDf("name") === xDf("name") && 
> unionDf("id") === xDf("id"))
> result.show
> {noformat}
> The result being
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> +---++---+++++
> {noformat}
> Force computing {{unionDf}} using {{count}} does not help change the result 
> of the join. However, writing the data to disk and reading it back does give 
> the correct result. But it is definitely not ideal. Interestingly caching the 
> {{unionDf}} also gives the correct result.
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> |  2|   b|  2|   b|null|  10|null|
> |  3|   c|  3|   c|null|null|9.73|
> +---++---+++++
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13369) Number of consecutive fetch failures for a stage before the job is aborted should be configurable

2017-03-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927085#comment-15927085
 ] 

Apache Spark commented on SPARK-13369:
--

User 'sitalkedia' has created a pull request for this issue:
https://github.com/apache/spark/pull/17307

>  Number of consecutive fetch failures for a stage before the job is aborted 
> should be configurable
> --
>
> Key: SPARK-13369
> URL: https://issues.apache.org/jira/browse/SPARK-13369
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Sital Kedia
>Priority: Minor
>
> Currently it is hardcode inside code. We need to make it configurable because 
> for long running jobs, the chances of fetch failures due to machine reboot is 
> high and we need a configuration parameter to bump up that number. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.

2017-03-15 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927071#comment-15927071
 ] 

Herman van Hovell commented on SPARK-19954:
---

This is reproducible on Spark 2.1. This is caused by a bug in the 
{{ConstantFolding}} optimizer rule, and this is has been fixed in SPARK-18300.

I think we will be doing a 2.1.x maintenance release soon, and that will 
contain this fix.

> Joining to a unioned DataFrame does not produce expected result.
> 
>
> Key: SPARK-19954
> URL: https://issues.apache.org/jira/browse/SPARK-19954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Arun Allamsetty
>Priority: Blocker
>
> I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is 
> that when we try to join two DataFrames, one of which is a result of a union 
> operation, the result of the join results in data as if the table was joined 
> only to the first table in the union. This issue is not present in Spark 
> 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it.
> {noformat}
> import spark.implicits._
> import org.apache.spark.sql.functions.lit
> case class A(id: Long, colA: Boolean)
> case class B(id: Long, colB: Int)
> case class C(id: Long, colC: Double)
> case class X(id: Long, name: String)
> val aData = A(1, true) :: Nil
> val bData = B(2, 10) :: Nil
> val cData = C(3, 9.73D) :: Nil
> val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil
> val aDf = spark.createDataset(aData).toDF
> val bDf = spark.createDataset(bData).toDF
> val cDf = spark.createDataset(cData).toDF
> val xDf = spark.createDataset(xData).toDF
> val unionDf =
>   aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), 
> lit(null).as("colC")).union(
>   bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", 
> lit(null).as("colC"))).union(
>   cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), 
> lit(null).as("colB"), $"colC"))
> val result = xDf.join(unionDf, unionDf("name") === xDf("name") && 
> unionDf("id") === xDf("id"))
> result.show
> {noformat}
> The result being
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> +---++---+++++
> {noformat}
> Force computing {{unionDf}} using {{count}} does not help change the result 
> of the join. However, writing the data to disk and reading it back does give 
> the correct result. But it is definitely not ideal. Interestingly caching the 
> {{unionDf}} also gives the correct result.
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> |  2|   b|  2|   b|null|  10|null|
> |  3|   c|  3|   c|null|null|9.73|
> +---++---+++++
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19960) Move `SparkHadoopWriter` to `internal/io/`

2017-03-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-19960.
-
   Resolution: Fixed
 Assignee: Jiang Xingbo
Fix Version/s: 2.2.0

> Move `SparkHadoopWriter` to `internal/io/`
> --
>
> Key: SPARK-19960
> URL: https://issues.apache.org/jira/browse/SPARK-19960
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
> Fix For: 2.2.0
>
>
> We should move `SparkHadoopWriter` to `internal/io/`, that will make it 
> easier to consolidate `SparkHadoopWriter` and `SparkHadoopMapReduceWriter`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition

2017-03-15 Thread Brian Bruggeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926992#comment-15926992
 ] 

Brian Bruggeman commented on SPARK-19872:
-

Wondering if a test will be added to prevent future regressions.

> UnicodeDecodeError in Pyspark on sc.textFile read with repartition
> --
>
> Key: SPARK-19872
> URL: https://issues.apache.org/jira/browse/SPARK-19872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Mac and EC2
>Reporter: Brian Bruggeman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.1, 2.2.0
>
>
> I'm receiving the following traceback:
> {code}
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> I created a textfile (text.txt) with standard linux newlines:
> {code}
> a
> b
> d
> e
> f
> g
> h
> i
> j
> k
> l
> {code}
> I think ran pyspark:
> {code}
> $ pyspark
> Python 2.7.13 (default, Dec 18 2016, 07:03:39)
> [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
> SparkSession available as 'spark'.
> >>> sc.textFile('test.txt').collect()
> [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
> >>> sc.textFile('test.txt', use_unicode=False).collect()
> ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
> >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect()
> ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', 
> '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> This really looks like a bug in the `serializers.py` code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.

2017-03-15 Thread Arun Allamsetty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926882#comment-15926882
 ] 

Arun Allamsetty commented on SPARK-19954:
-

[~maropu] and [~hyukjin.kwon] were you able to reproduce the issue on the 
released Spark 2.1 though?

> Joining to a unioned DataFrame does not produce expected result.
> 
>
> Key: SPARK-19954
> URL: https://issues.apache.org/jira/browse/SPARK-19954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Arun Allamsetty
>Priority: Blocker
>
> I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is 
> that when we try to join two DataFrames, one of which is a result of a union 
> operation, the result of the join results in data as if the table was joined 
> only to the first table in the union. This issue is not present in Spark 
> 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it.
> {noformat}
> import spark.implicits._
> import org.apache.spark.sql.functions.lit
> case class A(id: Long, colA: Boolean)
> case class B(id: Long, colB: Int)
> case class C(id: Long, colC: Double)
> case class X(id: Long, name: String)
> val aData = A(1, true) :: Nil
> val bData = B(2, 10) :: Nil
> val cData = C(3, 9.73D) :: Nil
> val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil
> val aDf = spark.createDataset(aData).toDF
> val bDf = spark.createDataset(bData).toDF
> val cDf = spark.createDataset(cData).toDF
> val xDf = spark.createDataset(xData).toDF
> val unionDf =
>   aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), 
> lit(null).as("colC")).union(
>   bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", 
> lit(null).as("colC"))).union(
>   cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), 
> lit(null).as("colB"), $"colC"))
> val result = xDf.join(unionDf, unionDf("name") === xDf("name") && 
> unionDf("id") === xDf("id"))
> result.show
> {noformat}
> The result being
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> +---++---+++++
> {noformat}
> Force computing {{unionDf}} using {{count}} does not help change the result 
> of the join. However, writing the data to disk and reading it back does give 
> the correct result. But it is definitely not ideal. Interestingly caching the 
> {{unionDf}} also gives the correct result.
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> |  2|   b|  2|   b|null|  10|null|
> |  3|   c|  3|   c|null|null|9.73|
> +---++---+++++
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19964) SparkSubmitSuite fails due to Timeout

2017-03-15 Thread Eren Avsarogullari (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-19964:
---
Attachment: SparkSubmitSuite_Stacktrace

> SparkSubmitSuite fails due to Timeout
> -
>
> Key: SPARK-19964
> URL: https://issues.apache.org/jira/browse/SPARK-19964
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Tests
>Affects Versions: 2.2.0
>Reporter: Eren Avsarogullari
>  Labels: flaky-test
> Attachments: SparkSubmitSuite_Stacktrace
>
>
> The following test case has been failed due to TestFailedDueToTimeoutException
> *Test Suite:* SparkSubmitSuite
> *Test Case:* includes jars passed in through --packages
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/
> *Stacktrace is also attached.*



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19964) SparkSubmitSuite fails due to Timeout

2017-03-15 Thread Eren Avsarogullari (JIRA)
Eren Avsarogullari created SPARK-19964:
--

 Summary: SparkSubmitSuite fails due to Timeout
 Key: SPARK-19964
 URL: https://issues.apache.org/jira/browse/SPARK-19964
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Tests
Affects Versions: 2.2.0
Reporter: Eren Avsarogullari


The following test case has been failed due to TestFailedDueToTimeoutException
*Test Suite:* SparkSubmitSuite
*Test Case:* includes jars passed in through --packages
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/
*Stacktrace is also attached.*



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13450) SortMergeJoin will OOM when join rows have lot of same keys

2017-03-15 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-13450.
---
   Resolution: Fixed
 Assignee: Tejas Patil
Fix Version/s: 2.2.0

> SortMergeJoin will OOM when join rows have lot of same keys
> ---
>
> Key: SPARK-13450
> URL: https://issues.apache.org/jira/browse/SPARK-13450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.2, 2.1.0
>Reporter: Hong Shen
>Assignee: Tejas Patil
> Fix For: 2.2.0
>
> Attachments: heap-dump-analysis.png
>
>
>   When I run a sql with join, task throw  java.lang.OutOfMemoryError and sql 
> failed. I have set spark.executor.memory  4096m.
>   SortMergeJoin use a ArrayBuffer[InternalRow] to store bufferedMatches, if 
> the join rows have a lot of same key, it will throw OutOfMemoryError.
> {code}
>   /** Buffered rows from the buffered side of the join. This is empty if 
> there are no matches. */
>   private[this] val bufferedMatches: ArrayBuffer[InternalRow] = new 
> ArrayBuffer[InternalRow]
> {code}
>   Here is the stackTrace:
> {code}
> org.xerial.snappy.SnappyNative.arrayCopy(Native Method)
> org.xerial.snappy.Snappy.arrayCopy(Snappy.java:84)
> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190)
> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163)
> java.io.DataInputStream.readFully(DataInputStream.java:195)
> java.io.DataInputStream.readLong(DataInputStream.java:416)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123)
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:84)
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoin.scala:300)
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.bufferMatchingRows(SortMergeJoin.scala:329)
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoin.scala:229)
> org.apache.spark.sql.execution.joins.SortMergeJoin$$anonfun$doExecute$1$$anon$1.advanceNext(SortMergeJoin.scala:105)
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741)
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:301)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:301)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> org.apache.spark.scheduler.Task.run(Task.scala:89)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19963) create view from select Fails when nullif() is used

2017-03-15 Thread Jay Danielsen (JIRA)
Jay Danielsen created SPARK-19963:
-

 Summary: create view from select Fails when nullif() is used
 Key: SPARK-19963
 URL: https://issues.apache.org/jira/browse/SPARK-19963
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Jay Danielsen
Priority: Minor


Test Case : Any valid query using nullif.

SELECT nullif(mycol,0) from mytable;

Create view FAILS when nullif used in select.

CREATE VIEW my_view as
SELECT nullif(mycol,0) from mytable;

Error: java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ...

I can refactor with CASE statement and create view successfully.

CREATE VIEW my_view as
SELECT CASE WHEN mycol = 0 THEN NULL ELSE mycol END mycol from mytable;



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19962) add DictVectorizor for DataFrame

2017-03-15 Thread yu peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926723#comment-15926723
 ] 

yu peng edited comment on SPARK-19962 at 3/15/17 6:42 PM:
--

yeah, with one indexer that performing like mutually exclusive  n "indexers". 
i can have a sparse version of the output  features fairly easy.

let's say if i want to train a classifer for gender based on age, country, 
hobbies. 
with DictVectorizor, it's going to be 

vec = DictVectorizor(sparse=True, inputColumns=['age', 'country', 'hobbies'], 
outputColumn='feature')

df= vec.fit_transform(df)
df= Binarizer(threshold=1.0, inputCol="_", 
outputCol="label").fit_transform(StringIndexer(inputColumn='gender', 
outputCol='_').fit_transform(df['gender'])
rf = RandomForestClassifier(labelCol="label", featuresCol="fetures", 
numTrees=10)
rf.fit(df)




was (Author: yupbank):
yeah, with one indexer that performing like mutually exclusive  n "indexers". 
i can have a sparse version of the output  features fairly easy.

let's say if i want to train a classifer for gender based on age, country, 
hobbies. 
with DictVectorizor, it's going to be 

vec = DictVectorizor(sparse=True)

df['fetures'] = vec.fit_transform(df[['age', 'country', 'hobbies']])
df['label'] = 
OnehotEncoder().fit_transform(StringIndexer().fit_transform(df['gender'])
rf = RandomForestClassifier(labelCol="label", featuresCol="fetures", 
numTrees=10)
rf.fit(df)



> add DictVectorizor for DataFrame
> 
>
> Key: SPARK-19962
> URL: https://issues.apache.org/jira/browse/SPARK-19962
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: yu peng
>  Labels: features
>
> it's really useful to have something like 
> sklearn.feature_extraction.DictVectorizor
> Since out features lives in json/data frame like format and 
> classifier/regressors only take vector input. so there is a gap between them.
> something like 
> ```
> df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', 
> hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us',  
> hobbies=['sing']), ])
> df.show()
> |age|gender|country|hobbies|
> |1|male|cn|[sing, dance]|
> |3|female|us|[sing]|
> import DictVectorizor
> vec = DictVectorizor()
> matrix = vec.fit_transform(df)
> matrix.show()
> |features|
> |[1, 0, 1, 0, 1, 1, 1]|
> |[3, 1, 0, 1, 0, 1, 1]|
> vec.show()
> |feature_name| feature_dimension|
> |age|0|
> |gender=female|1|
> |gender=male|2|
> |country=us|3|
> |country=cn|4|
> |hobbies=sing|5|
> |hobbies=dance|6|
> ```



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19962) add DictVectorizor for DataFrame

2017-03-15 Thread yu peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926723#comment-15926723
 ] 

yu peng commented on SPARK-19962:
-

yeah, with one indexer that performing like mutually exclusive  n "indexers". 
i can have a sparse version of the output  features fairly easy.

let's say if i want to train a classifer for gender based on age, country, 
hobbies. 
with DictVectorizor, it's going to be 

vec = DictVectorizor(sparse=True)

df['fetures'] = vec.fit_transform(df[['age', 'country', 'hobbies']])
df['label'] = 
OnehotEncoder().fit_transform(StringIndexer().fit_transform(df['gender'])
rf = RandomForestClassifier(labelCol="label", featuresCol="fetures", 
numTrees=10)
rf.fit(df)



> add DictVectorizor for DataFrame
> 
>
> Key: SPARK-19962
> URL: https://issues.apache.org/jira/browse/SPARK-19962
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: yu peng
>  Labels: features
>
> it's really useful to have something like 
> sklearn.feature_extraction.DictVectorizor
> Since out features lives in json/data frame like format and 
> classifier/regressors only take vector input. so there is a gap between them.
> something like 
> ```
> df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', 
> hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us',  
> hobbies=['sing']), ])
> df.show()
> |age|gender|country|hobbies|
> |1|male|cn|[sing, dance]|
> |3|female|us|[sing]|
> import DictVectorizor
> vec = DictVectorizor()
> matrix = vec.fit_transform(df)
> matrix.show()
> |features|
> |[1, 0, 1, 0, 1, 1, 1]|
> |[3, 1, 0, 1, 0, 1, 1]|
> vec.show()
> |feature_name| feature_dimension|
> |age|0|
> |gender=female|1|
> |gender=male|2|
> |country=us|3|
> |country=cn|4|
> |hobbies=sing|5|
> |hobbies=dance|6|
> ```



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19962) add DictVectorizor for DataFrame

2017-03-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926680#comment-15926680
 ] 

Sean Owen commented on SPARK-19962:
---

You would have n indexers and n models for n different columns; you need n 
mappings no matter what. I'm missing the distinction here.
Why would you need the encodings of n different columns to be mutually 
exclusive -- if that's the point of the example?

> add DictVectorizor for DataFrame
> 
>
> Key: SPARK-19962
> URL: https://issues.apache.org/jira/browse/SPARK-19962
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: yu peng
>  Labels: features
>
> it's really useful to have something like 
> sklearn.feature_extraction.DictVectorizor
> Since out features lives in json/data frame like format and 
> classifier/regressors only take vector input. so there is a gap between them.
> something like 
> ```
> df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', 
> hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us',  
> hobbies=['sing']), ])
> df.show()
> |age|gender|country|hobbies|
> |1|male|cn|[sing, dance]|
> |3|female|us|[sing]|
> import DictVectorizor
> vec = DictVectorizor()
> matrix = vec.fit_transform(df)
> matrix.show()
> |features|
> |[1, 0, 1, 0, 1, 1, 1]|
> |[3, 1, 0, 1, 0, 1, 1]|
> vec.show()
> |feature_name| feature_dimension|
> |age|0|
> |gender=female|1|
> |gender=male|2|
> |country=us|3|
> |country=cn|4|
> |hobbies=sing|5|
> |hobbies=dance|6|
> ```



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition

2017-03-15 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-19872.

   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> UnicodeDecodeError in Pyspark on sc.textFile read with repartition
> --
>
> Key: SPARK-19872
> URL: https://issues.apache.org/jira/browse/SPARK-19872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Mac and EC2
>Reporter: Brian Bruggeman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.1, 2.2.0
>
>
> I'm receiving the following traceback:
> {code}
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> I created a textfile (text.txt) with standard linux newlines:
> {code}
> a
> b
> d
> e
> f
> g
> h
> i
> j
> k
> l
> {code}
> I think ran pyspark:
> {code}
> $ pyspark
> Python 2.7.13 (default, Dec 18 2016, 07:03:39)
> [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
> SparkSession available as 'spark'.
> >>> sc.textFile('test.txt').collect()
> [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
> >>> sc.textFile('test.txt', use_unicode=False).collect()
> ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
> >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect()
> ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', 
> '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> This really looks like a bug in the `serializers.py` code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition

2017-03-15 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-19872:
--

Assignee: Hyukjin Kwon

> UnicodeDecodeError in Pyspark on sc.textFile read with repartition
> --
>
> Key: SPARK-19872
> URL: https://issues.apache.org/jira/browse/SPARK-19872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Mac and EC2
>Reporter: Brian Bruggeman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.1, 2.2.0
>
>
> I'm receiving the following traceback:
> {code}
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> I created a textfile (text.txt) with standard linux newlines:
> {code}
> a
> b
> d
> e
> f
> g
> h
> i
> j
> k
> l
> {code}
> I think ran pyspark:
> {code}
> $ pyspark
> Python 2.7.13 (default, Dec 18 2016, 07:03:39)
> [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
> SparkSession available as 'spark'.
> >>> sc.textFile('test.txt').collect()
> [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
> >>> sc.textFile('test.txt', use_unicode=False).collect()
> ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
> >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect()
> ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', 
> '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> This really looks like a bug in the `serializers.py` code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19962) add DictVectorizor for DataFrame

2017-03-15 Thread yu peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926559#comment-15926559
 ] 

yu peng commented on SPARK-19962:
-

>Well, it's maintained by the StringIndexerModel for you.
no i mean the structure across different columns. say i want to get back the 
third column from `[1, 0, 1, 0, 1, 1, 1]`  to `gender=male`
StringIndexerModel only maintain one column based input mapping.

yeah, any categorical should be able to converted to a string. but the idea is 
maintain dimension mapping across different StringIndexers.

> add DictVectorizor for DataFrame
> 
>
> Key: SPARK-19962
> URL: https://issues.apache.org/jira/browse/SPARK-19962
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: yu peng
>  Labels: features
>
> it's really useful to have something like 
> sklearn.feature_extraction.DictVectorizor
> Since out features lives in json/data frame like format and 
> classifier/regressors only take vector input. so there is a gap between them.
> something like 
> ```
> df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', 
> hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us',  
> hobbies=['sing']), ])
> df.show()
> |age|gender|country|hobbies|
> |1|male|cn|[sing, dance]|
> |3|female|us|[sing]|
> import DictVectorizor
> vec = DictVectorizor()
> matrix = vec.fit_transform(df)
> matrix.show()
> |features|
> |[1, 0, 1, 0, 1, 1, 1]|
> |[3, 1, 0, 1, 0, 1, 1]|
> vec.show()
> |feature_name| feature_dimension|
> |age|0|
> |gender=female|1|
> |gender=male|2|
> |country=us|3|
> |country=cn|4|
> |hobbies=sing|5|
> |hobbies=dance|6|
> ```



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19962) add DictVectorizor for DataFrame

2017-03-15 Thread yu peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yu peng updated SPARK-19962:

Description: 
it's really useful to have something like 
sklearn.feature_extraction.DictVectorizor

Since out features lives in json/data frame like format and 
classifier/regressors only take vector input. so there is a gap between them.
something like 
```
df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', 
hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us',  
hobbies=['sing']), ])
df.show()

|age|gender|country|hobbies|
|1|male|cn|[sing, dance]|
|3|female|us|[sing]|

import DictVectorizor
vec = DictVectorizor()
matrix = vec.fit_transform(df)
matrix.show()

|features|
|[1, 0, 1, 0, 1, 1, 1]|
|[3, 1, 0, 1, 0, 1, 1]|

vec.show()
|feature_name| feature_dimension|
|age|0|
|gender=female|1|
|gender=male|2|
|country=us|3|
|country=cn|4|
|hobbies=sing|5|
|hobbies=dance|6|
```


  was:
it's really useful to have something like 
sklearn.feature_extraction.DictVectorizor

Since out features lives in json/data frame like format and 
classifier/regressors only take vector input. so there is a gap between them.
something like 
```
df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', 
hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us',  
hobbies=['sing']), ])
import DictVectorizor
vec = DictVectorizor()
matrix = vec.fit_transform(df)
matrix.show()

|features|
|[1, 0, 1, 0, 1, 1, 1]|
|[3, 1, 0, 1, 0, 1, 1]|

vec.show()
|feature_name| feature_dimension|
|age|0|
|gender=female|1|
|gender=male|2|
|country=us|3|
|country=cn|4|
|hobbies=sing|5|
|hobbies=dance|6|
```



> add DictVectorizor for DataFrame
> 
>
> Key: SPARK-19962
> URL: https://issues.apache.org/jira/browse/SPARK-19962
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: yu peng
>  Labels: features
>
> it's really useful to have something like 
> sklearn.feature_extraction.DictVectorizor
> Since out features lives in json/data frame like format and 
> classifier/regressors only take vector input. so there is a gap between them.
> something like 
> ```
> df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', 
> hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us',  
> hobbies=['sing']), ])
> df.show()
> |age|gender|country|hobbies|
> |1|male|cn|[sing, dance]|
> |3|female|us|[sing]|
> import DictVectorizor
> vec = DictVectorizor()
> matrix = vec.fit_transform(df)
> matrix.show()
> |features|
> |[1, 0, 1, 0, 1, 1, 1]|
> |[3, 1, 0, 1, 0, 1, 1]|
> vec.show()
> |feature_name| feature_dimension|
> |age|0|
> |gender=female|1|
> |gender=male|2|
> |country=us|3|
> |country=cn|4|
> |hobbies=sing|5|
> |hobbies=dance|6|
> ```



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19962) add DictVectorizor for DataFrame

2017-03-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926543#comment-15926543
 ] 

Sean Owen commented on SPARK-19962:
---

Well, it's maintained by the StringIndexerModel for you.
What's your non-integer, non-string use case that can't be converted to a 
string but is categorical?

> add DictVectorizor for DataFrame
> 
>
> Key: SPARK-19962
> URL: https://issues.apache.org/jira/browse/SPARK-19962
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: yu peng
>  Labels: features
>
> it's really useful to have something like 
> sklearn.feature_extraction.DictVectorizor
> Since out features lives in json/data frame like format and 
> classifier/regressors only take vector input. so there is a gap between them.
> something like 
> ```
> df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', 
> hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us',  
> hobbies=['sing']), ])
> import DictVectorizor
> vec = DictVectorizor()
> matrix = vec.fit_transform(df)
> matrix.show()
> |features|
> |[1, 0, 1, 0, 1, 1, 1]|
> |[3, 1, 0, 1, 0, 1, 1]|
> vec.show()
> |feature_name| feature_dimension|
> |age|0|
> |gender=female|1|
> |gender=male|2|
> |country=us|3|
> |country=cn|4|
> |hobbies=sing|5|
> |hobbies=dance|6|
> ```



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19962) add DictVectorizor for DataFrame

2017-03-15 Thread yu peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926537#comment-15926537
 ] 

yu peng commented on SPARK-19962:
-

yeah.. StringIndexer does some job on single column and only string type.

the relationship between columns and feature names need to be maintained by 
some complex data structure.
 
^i have updated the description again add some enumerate features.

> add DictVectorizor for DataFrame
> 
>
> Key: SPARK-19962
> URL: https://issues.apache.org/jira/browse/SPARK-19962
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: yu peng
>  Labels: features
>
> it's really useful to have something like 
> sklearn.feature_extraction.DictVectorizor
> Since out features lives in json/data frame like format and 
> classifier/regressors only take vector input. so there is a gap between them.
> something like 
> ```
> df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', 
> hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us',  
> hobbies=['sing']), ])
> import DictVectorizor
> vec = DictVectorizor()
> matrix = vec.fit_transform(df)
> matrix.show()
> |features|
> |[1, 0, 1, 0, 1, 1, 1]|
> |[3, 1, 0, 1, 0, 1, 1]|
> vec.show()
> |feature_name| feature_dimension|
> |age|0|
> |gender=female|1|
> |gender=male|2|
> |country=us|3|
> |country=cn|4|
> |hobbies=sing|5|
> |hobbies=dance|6|
> ```



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19962) add DictVectorizor for DataFrame

2017-03-15 Thread yu peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yu peng updated SPARK-19962:

Description: 
it's really useful to have something like 
sklearn.feature_extraction.DictVectorizor

Since out features lives in json/data frame like format and 
classifier/regressors only take vector input. so there is a gap between them.
something like 
```
df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', 
hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us',  
hobbies=['sing']), ])
import DictVectorizor
vec = DictVectorizor()
matrix = vec.fit_transform(df)
matrix.show()

|features|
|[1, 0, 1, 0, 1, 1, 1]|
|[3, 1, 0, 1, 0, 1, 1]|

vec.show()
|feature_name| feature_dimension|
|age|0|
|gender=female|1|
|gender=male|2|
|country=us|3|
|country=cn|4|
|hobbies=sing|5|
|hobbies=dance|6|
```


  was:
it's really useful to have something like 
sklearn.feature_extraction.DictVectorizor

Since out features lives in json/data frame like format and 
classifier/regressors only take vector input. so there is a gap between them.
something like 
```
df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn'),Row(age=3, 
gender='female', country='us'), ])
import DictVectorizor
vec = DictVectorizor()
matrix = vec.fit_transform(df)
matrix.show()

|features|
|[1, 0, 1, 0, 1]|
|[3, 1, 0, 1, 0]|

vec.show()
|feature_name| feature_dimension|
|age|0|
|gender=female|1|
|gender=male|2|
|country=us|3|
|country=cn|4|
```



> add DictVectorizor for DataFrame
> 
>
> Key: SPARK-19962
> URL: https://issues.apache.org/jira/browse/SPARK-19962
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: yu peng
>  Labels: features
>
> it's really useful to have something like 
> sklearn.feature_extraction.DictVectorizor
> Since out features lives in json/data frame like format and 
> classifier/regressors only take vector input. so there is a gap between them.
> something like 
> ```
> df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', 
> hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us',  
> hobbies=['sing']), ])
> import DictVectorizor
> vec = DictVectorizor()
> matrix = vec.fit_transform(df)
> matrix.show()
> |features|
> |[1, 0, 1, 0, 1, 1, 1]|
> |[3, 1, 0, 1, 0, 1, 1]|
> vec.show()
> |feature_name| feature_dimension|
> |age|0|
> |gender=female|1|
> |gender=male|2|
> |country=us|3|
> |country=cn|4|
> |hobbies=sing|5|
> |hobbies=dance|6|
> ```



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19962) add DictVectorizor for DataFrame

2017-03-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926518#comment-15926518
 ] 

Sean Owen commented on SPARK-19962:
---

That sounds like what 
https://spark.apache.org/docs/latest/ml-features.html#stringindexer does

> add DictVectorizor for DataFrame
> 
>
> Key: SPARK-19962
> URL: https://issues.apache.org/jira/browse/SPARK-19962
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: yu peng
>  Labels: features
>
> it's really useful to have something like 
> sklearn.feature_extraction.DictVectorizor
> Since out features lives in json/data frame like format and 
> classifier/regressors only take vector input. so there is a gap between them.
> something like 
> ```
> df = sqlCtx.createDataFrame([Row(age=1, gender='male', 
> country='cn'),Row(age=3, gender='female', country='us'), ])
> import DictVectorizor
> vec = DictVectorizor()
> matrix = vec.fit_transform(df)
> matrix.show()
> |features|
> |[1, 0, 1, 0, 1]|
> |[3, 1, 0, 1, 0]|
> vec.show()
> |feature_name| feature_dimension|
> |age|0|
> |gender=female|1|
> |gender=male|2|
> |country=us|3|
> |country=cn|4|
> ```



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19962) add DictVectorizor for DataFrame

2017-03-15 Thread yu peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926458#comment-15926458
 ] 

yu peng commented on SPARK-19962:
-

^ i have updated the description

> add DictVectorizor for DataFrame
> 
>
> Key: SPARK-19962
> URL: https://issues.apache.org/jira/browse/SPARK-19962
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: yu peng
>  Labels: features
>
> it's really useful to have something like 
> sklearn.feature_extraction.DictVectorizor
> Since out features lives in json/data frame like format and 
> classifier/regressors only take vector input. so there is a gap between them.
> something like 
> ```
> df = sqlCtx.createDataFrame([Row(age=1, gender='male', 
> country='cn'),Row(age=3, gender='female', country='us'), ])
> import DictVectorizor
> vec = DictVectorizor()
> matrix = vec.fit_transform(df)
> matrix.show()
> |features|
> |[1, 0, 1, 0, 1]|
> |[3, 1, 0, 1, 0]|
> vec.show()
> |feature_name| feature_dimension|
> |age|0|
> |gender=female|1|
> |gender=male|2|
> |country=us|3|
> |country=cn|4|
> ```



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19962) add DictVectorizor for DataFrame

2017-03-15 Thread yu peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yu peng updated SPARK-19962:

Description: 
it's really useful to have something like 
sklearn.feature_extraction.DictVectorizor

Since out features lives in json/data frame like format and 
classifier/regressors only take vector input. so there is a gap between them.
something like 
```
df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn'),Row(age=3, 
gender='female', country='us'), ])
import DictVectorizor
vec = DictVectorizor()
matrix = vec.fit_transform(df)
matrix.show()

|features|
|[1, 0, 1, 0, 1]|
|[3, 1, 0, 1, 0]|

vec.show()
|feature_name| feature_dimension|
|age|0|
|gender=female|1|
|gender=male|2|
|country=us|3|
|country=cn|4|
```


  was:
it's really useful to have something like 
sklearn.feature_extraction.DictVectorizor

Since out features lives in json/data frame like format and 
classifier/regressors only take vector input. so there is a gap between them.


> add DictVectorizor for DataFrame
> 
>
> Key: SPARK-19962
> URL: https://issues.apache.org/jira/browse/SPARK-19962
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: yu peng
>  Labels: features
>
> it's really useful to have something like 
> sklearn.feature_extraction.DictVectorizor
> Since out features lives in json/data frame like format and 
> classifier/regressors only take vector input. so there is a gap between them.
> something like 
> ```
> df = sqlCtx.createDataFrame([Row(age=1, gender='male', 
> country='cn'),Row(age=3, gender='female', country='us'), ])
> import DictVectorizor
> vec = DictVectorizor()
> matrix = vec.fit_transform(df)
> matrix.show()
> |features|
> |[1, 0, 1, 0, 1]|
> |[3, 1, 0, 1, 0]|
> vec.show()
> |feature_name| feature_dimension|
> |age|0|
> |gender=female|1|
> |gender=male|2|
> |country=us|3|
> |country=cn|4|
> ```



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-03-15 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926403#comment-15926403
 ] 

Imran Rashid edited comment on SPARK-12297 at 3/15/17 3:53 PM:
---

To expand on the original description-- Spark copied Hive's behavior for 
parquet, but this was inconsistent with Impala (which is the original source of 
putting a timestamp as an int96 in parquet, I believe), and inconsistent with 
other file formats.  This made timestamps in parquet act more like timestamps 
with timezones, while in other file formats, timestamps have no time zone, they 
are a "floating time".

The easiest way to see this issue is to write out a table with timestamps in 
multiple different formats from one timezone, then try to read them back in 
another timezone.  Eg., here I write out a few timestamps to parquet and 
textfile hive tables, and also just as a json file, all in the 
"America/Los_Angeles" timezone:

{code}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val tblPrefix = args(0)
val schema = new StructType().add("ts", TimestampType)
val rows = sc.parallelize(Seq(
  "2015-12-31 23:50:59.123",
  "2015-12-31 22:49:59.123",
  "2016-01-01 00:39:59.123",
  "2016-01-01 01:29:59.123"
).map { x => Row(java.sql.Timestamp.valueOf(x)) })
val rawData = spark.createDataFrame(rows, schema).toDF()

rawData.show()

Seq("parquet", "textfile").foreach { format =>
  val tblName = s"${tblPrefix}_$format"
  spark.sql(s"DROP TABLE IF EXISTS $tblName")
  spark.sql(
raw"""CREATE TABLE $tblName (
  |  ts timestamp
  | )
  | STORED AS $format
 """.stripMargin)
  rawData.write.insertInto(tblName)
}

rawData.write.json(s"${tblPrefix}_json")
{code}

Then I start a spark-shell in "America/New_York" timezone, and read the data 
back from each table:

{code}
scala> spark.sql("select * from la_parquet").collect().foreach{println}
[2016-01-01 02:50:59.123]
[2016-01-01 01:49:59.123]
[2016-01-01 03:39:59.123]
[2016-01-01 04:29:59.123]

scala> spark.sql("select * from la_textfile").collect().foreach{println}
[2015-12-31 23:50:59.123]
[2015-12-31 22:49:59.123]
[2016-01-01 00:39:59.123]
[2016-01-01 01:29:59.123]

scala> spark.read.json("la_json").collect().foreach{println}
[2015-12-31 23:50:59.123]
[2015-12-31 22:49:59.123]
[2016-01-01 00:39:59.123]
[2016-01-01 01:29:59.123]

scala> spark.read.json("la_json").join(spark.sql("select * from la_textfile"), 
"ts").show()
++
|  ts|
++
|2015-12-31 23:50:...|
|2015-12-31 22:49:...|
|2016-01-01 00:39:...|
|2016-01-01 01:29:...|
++

scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
"ts").show()
+---+
| ts|
+---+
+---+
{code}

The textfile and json based data shows the same times, and can be joined 
against each other, while the times from the parquet data have changed (and 
obviously joins fail).

This is a big problem for any organization that may try to read the same data 
(say in S3) with clusters in multiple timezones.  It can also be a nasty 
surprise as an organization tries to migrate file formats.  Finally, its a 
source of incompatibility between Hive, Impala, and Spark.

HIVE-12767 aims to fix this by introducing a table property which indicates the 
"storage timezone" for the table.  Spark should add the same to ensure 
consistency between file formats and with Hive & Impala.

[~rdblue] unless you have any objections I'll update the bug description to 
reflect this.


was (Author: irashid):
To expand on the original description-- Spark copied Hive's behavior for 
parquet, but this was inconsistent with Impala (which is the original source of 
putting a timestamp as an int96 in parquet, I believe), and inconsistent with 
other file formats.  This made timestamps in parquet act more like timestamps 
with timezones, while in other file formats, timestamps have no time zone, they 
are a "floating time".

The easiest way to see this issue is to write out a table with timestamps in 
multiple different formats from one timezone, then try to read them back in 
another timezone.  Eg., here I write out a few timestamps to parquet and 
textfile hive tables, and also just as a json file, all in the 
"America/Los_Angeles" timezone:

{code}

{code}

Then I start a spark-shell in "America/New_York" timezone, and read the data 
back from each table:

{code}
scala> spark.sql("select * from la_parquet").collect().foreach{println}
[2016-01-01 02:50:59.123]
[2016-01-01 01:49:59.123]
[2016-01-01 03:39:59.123]
[2016-01-01 04:29:59.123]

scala> spark.sql("select * from la_textfile").collect().foreach{println}
[2015-12-31 23:50:59.123]
[2015-12-31 22:49:59.123]
[2016-01-01 00:39:59.123]
[2016-01-01 01:29:59.123]

scala> spark.read.json("la_json").collect().foreach{println}
[2015-12-31 23:50:59.123]
[2015-12-31 22:49:59.123]
[2016-01-01 

[jira] [Closed] (SPARK-19957) Inconsist KMeans initialization mode behavior between ML and MLlib

2017-03-15 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang closed SPARK-19957.
--
Resolution: Not A Problem

> Inconsist KMeans initialization mode behavior between ML and MLlib
> --
>
> Key: SPARK-19957
> URL: https://issues.apache.org/jira/browse/SPARK-19957
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: yuhao yang
>Priority: Minor
>
> when users set the initialization mode to "random", KMeans in ML and MLlib 
> has inconsistent behavior for multiple runs:
> MLlib will basically use new Random for each run.
> ML Kmeans however will use the default random seed, which is 
> {code}this.getClass.getName.hashCode.toLong{code}, and keep using the same 
> number among multiple fitting.
> I would expect the "random" initialization mode to be literally random. 
> There're different solutions with different scope of impact. Adjusting the 
> hasSeed trait may have a broader impact(but maybe worth discussion). We can 
> always just set random default seed in KMeans. 
> Appreciate your feedback.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-03-15 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926403#comment-15926403
 ] 

Imran Rashid commented on SPARK-12297:
--

To expand on the original description-- Spark copied Hive's behavior for 
parquet, but this was inconsistent with Impala (which is the original source of 
putting a timestamp as an int96 in parquet, I believe), and inconsistent with 
other file formats.  This made timestamps in parquet act more like timestamps 
with timezones, while in other file formats, timestamps have no time zone, they 
are a "floating time".

The easiest way to see this issue is to write out a table with timestamps in 
multiple different formats from one timezone, then try to read them back in 
another timezone.  Eg., here I write out a few timestamps to parquet and 
textfile hive tables, and also just as a json file, all in the 
"America/Los_Angeles" timezone:

{code}

{code}

Then I start a spark-shell in "America/New_York" timezone, and read the data 
back from each table:

{code}
scala> spark.sql("select * from la_parquet").collect().foreach{println}
[2016-01-01 02:50:59.123]
[2016-01-01 01:49:59.123]
[2016-01-01 03:39:59.123]
[2016-01-01 04:29:59.123]

scala> spark.sql("select * from la_textfile").collect().foreach{println}
[2015-12-31 23:50:59.123]
[2015-12-31 22:49:59.123]
[2016-01-01 00:39:59.123]
[2016-01-01 01:29:59.123]

scala> spark.read.json("la_json").collect().foreach{println}
[2015-12-31 23:50:59.123]
[2015-12-31 22:49:59.123]
[2016-01-01 00:39:59.123]
[2016-01-01 01:29:59.123]

scala> spark.read.json("la_json").join(spark.sql("select * from la_textfile"), 
"ts").show()
++
|  ts|
++
|2015-12-31 23:50:...|
|2015-12-31 22:49:...|
|2016-01-01 00:39:...|
|2016-01-01 01:29:...|
++

scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
"ts").show()
+---+
| ts|
+---+
+---+
{code}

The textfile and json based data shows the same times, and can be joined 
against each other, while the times from the parquet data have changed (and 
obviously joins fail).

This is a big problem for any organization that may try to read the same data 
(say in S3) with clusters in multiple timezones.  It can also be a nasty 
surprise as an organization tries to migrate file formats.  Finally, its a 
source of incompatibility between Hive, Impala, and Spark.

HIVE-12767 aims to fix this by introducing a table property which indicates the 
"storage timezone" for the table.  Spark should add the same to ensure 
consistency between file formats and with Hive & Impala.

[~rdblue] unless you have any objections I'll update the bug description to 
reflect this.

> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
>
> Hive has a bug where timestamps in Parquet data are incorrectly adjusted as 
> though they were in the SQL session time zone to UTC. This is incorrect 
> behavior because timestamp values are SQL timestamp without time zone and 
> should not be internally changed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19957) Inconsist KMeans initialization mode behavior between ML and MLlib

2017-03-15 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926404#comment-15926404
 ] 

yuhao yang commented on SPARK-19957:


Thanks for the response.

> Inconsist KMeans initialization mode behavior between ML and MLlib
> --
>
> Key: SPARK-19957
> URL: https://issues.apache.org/jira/browse/SPARK-19957
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: yuhao yang
>Priority: Minor
>
> when users set the initialization mode to "random", KMeans in ML and MLlib 
> has inconsistent behavior for multiple runs:
> MLlib will basically use new Random for each run.
> ML Kmeans however will use the default random seed, which is 
> {code}this.getClass.getName.hashCode.toLong{code}, and keep using the same 
> number among multiple fitting.
> I would expect the "random" initialization mode to be literally random. 
> There're different solutions with different scope of impact. Adjusting the 
> hasSeed trait may have a broader impact(but maybe worth discussion). We can 
> always just set random default seed in KMeans. 
> Appreciate your feedback.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19962) add DictVectorizor for DataFrame

2017-03-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926393#comment-15926393
 ] 

Sean Owen commented on SPARK-19962:
---

Can you describe what this does, and give an example of input/output? I'm not 
sure what this describes or whether it already exists in Spark.

> add DictVectorizor for DataFrame
> 
>
> Key: SPARK-19962
> URL: https://issues.apache.org/jira/browse/SPARK-19962
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: yu peng
>  Labels: features
>
> it's really useful to have something like 
> sklearn.feature_extraction.DictVectorizor
> Since out features lives in json/data frame like format and 
> classifier/regressors only take vector input. so there is a gap between them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19962) add DictVectorizor for DataFrame

2017-03-15 Thread yu peng (JIRA)
yu peng created SPARK-19962:
---

 Summary: add DictVectorizor for DataFrame
 Key: SPARK-19962
 URL: https://issues.apache.org/jira/browse/SPARK-19962
 Project: Spark
  Issue Type: Wish
  Components: ML
Affects Versions: 2.1.0
Reporter: yu peng


it's really useful to have something like 
sklearn.feature_extraction.DictVectorizor

Since out features lives in json/data frame like format and 
classifier/regressors only take vector input. so there is a gap between them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19961) unify a exception erro msg for dropdatabase

2017-03-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19961:
--
Affects Version/s: (was: 2.2.0)
   2.1.0
 Priority: Trivial  (was: Minor)
   Issue Type: Improvement  (was: Bug)

> unify a exception erro msg for dropdatabase
> ---
>
> Key: SPARK-19961
> URL: https://issues.apache.org/jira/browse/SPARK-19961
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Song Jun
>Priority: Trivial
>
> unify a exception erro msg for dropdatabase when the database still have some 
> tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19961) unify a exception erro msg for dropdatabase

2017-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19961:


Assignee: (was: Apache Spark)

> unify a exception erro msg for dropdatabase
> ---
>
> Key: SPARK-19961
> URL: https://issues.apache.org/jira/browse/SPARK-19961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Priority: Minor
>
> unify a exception erro msg for dropdatabase when the database still have some 
> tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19961) unify a exception erro msg for dropdatabase

2017-03-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926304#comment-15926304
 ] 

Apache Spark commented on SPARK-19961:
--

User 'windpiger' has created a pull request for this issue:
https://github.com/apache/spark/pull/17305

> unify a exception erro msg for dropdatabase
> ---
>
> Key: SPARK-19961
> URL: https://issues.apache.org/jira/browse/SPARK-19961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Priority: Minor
>
> unify a exception erro msg for dropdatabase when the database still have some 
> tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19961) unify a exception erro msg for dropdatabase

2017-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19961:


Assignee: Apache Spark

> unify a exception erro msg for dropdatabase
> ---
>
> Key: SPARK-19961
> URL: https://issues.apache.org/jira/browse/SPARK-19961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Assignee: Apache Spark
>Priority: Minor
>
> unify a exception erro msg for dropdatabase when the database still have some 
> tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19961) unify a exception erro msg for dropdatabase

2017-03-15 Thread Song Jun (JIRA)
Song Jun created SPARK-19961:


 Summary: unify a exception erro msg for dropdatabase
 Key: SPARK-19961
 URL: https://issues.apache.org/jira/browse/SPARK-19961
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun
Priority: Minor


unify a exception erro msg for dropdatabase when the database still have some 
tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19944) Move SQLConf from sql/core to sql/catalyst

2017-03-15 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-19944:
--
Fix Version/s: 2.1.1

> Move SQLConf from sql/core to sql/catalyst
> --
>
> Key: SPARK-19944
> URL: https://issues.apache.org/jira/browse/SPARK-19944
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.1, 2.2.0
>
>
> It is pretty weird to have SQLConf only in sql/core and then we have to 
> duplicate config options that impact optimizer/analyzer in sql/catalyst using 
> CatalystConf. This ticket moves SQLConf into sql/catalyst.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19960) Move `SparkHadoopWriter` to `internal/io/`

2017-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19960:


Assignee: (was: Apache Spark)

> Move `SparkHadoopWriter` to `internal/io/`
> --
>
> Key: SPARK-19960
> URL: https://issues.apache.org/jira/browse/SPARK-19960
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Jiang Xingbo
>
> We should move `SparkHadoopWriter` to `internal/io/`, that will make it 
> easier to consolidate `SparkHadoopWriter` and `SparkHadoopMapReduceWriter`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19960) Move `SparkHadoopWriter` to `internal/io/`

2017-03-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926014#comment-15926014
 ] 

Apache Spark commented on SPARK-19960:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/17304

> Move `SparkHadoopWriter` to `internal/io/`
> --
>
> Key: SPARK-19960
> URL: https://issues.apache.org/jira/browse/SPARK-19960
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Jiang Xingbo
>
> We should move `SparkHadoopWriter` to `internal/io/`, that will make it 
> easier to consolidate `SparkHadoopWriter` and `SparkHadoopMapReduceWriter`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19960) Move `SparkHadoopWriter` to `internal/io/`

2017-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19960:


Assignee: Apache Spark

> Move `SparkHadoopWriter` to `internal/io/`
> --
>
> Key: SPARK-19960
> URL: https://issues.apache.org/jira/browse/SPARK-19960
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>
> We should move `SparkHadoopWriter` to `internal/io/`, that will make it 
> easier to consolidate `SparkHadoopWriter` and `SparkHadoopMapReduceWriter`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19960) Move `SparkHadoopWriter` to `internal/io/`

2017-03-15 Thread Jiang Xingbo (JIRA)
Jiang Xingbo created SPARK-19960:


 Summary: Move `SparkHadoopWriter` to `internal/io/`
 Key: SPARK-19960
 URL: https://issues.apache.org/jira/browse/SPARK-19960
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Jiang Xingbo


We should move `SparkHadoopWriter` to `internal/io/`, that will make it easier 
to consolidate `SparkHadoopWriter` and `SparkHadoopMapReduceWriter`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19112) add codec for ZStandard

2017-03-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925933#comment-15925933
 ] 

Apache Spark commented on SPARK-19112:
--

User 'dongjinleekr' has created a pull request for this issue:
https://github.com/apache/spark/pull/17303

> add codec for ZStandard
> ---
>
> Key: SPARK-19112
> URL: https://issues.apache.org/jira/browse/SPARK-19112
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Thomas Graves
>Priority: Minor
>
> ZStandard: https://github.com/facebook/zstd and 
> http://facebook.github.io/zstd/ has been in use for a while now. v1.0 was 
> recently released. Hadoop 
> (https://issues.apache.org/jira/browse/HADOOP-13578) and others 
> (https://issues.apache.org/jira/browse/KAFKA-4514) are adopting it.
> Zstd seems to give great results => Gzip level Compression with Lz4 level CPU.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19112) add codec for ZStandard

2017-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19112:


Assignee: Apache Spark

> add codec for ZStandard
> ---
>
> Key: SPARK-19112
> URL: https://issues.apache.org/jira/browse/SPARK-19112
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Minor
>
> ZStandard: https://github.com/facebook/zstd and 
> http://facebook.github.io/zstd/ has been in use for a while now. v1.0 was 
> recently released. Hadoop 
> (https://issues.apache.org/jira/browse/HADOOP-13578) and others 
> (https://issues.apache.org/jira/browse/KAFKA-4514) are adopting it.
> Zstd seems to give great results => Gzip level Compression with Lz4 level CPU.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19112) add codec for ZStandard

2017-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19112:


Assignee: (was: Apache Spark)

> add codec for ZStandard
> ---
>
> Key: SPARK-19112
> URL: https://issues.apache.org/jira/browse/SPARK-19112
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Thomas Graves
>Priority: Minor
>
> ZStandard: https://github.com/facebook/zstd and 
> http://facebook.github.io/zstd/ has been in use for a while now. v1.0 was 
> recently released. Hadoop 
> (https://issues.apache.org/jira/browse/HADOOP-13578) and others 
> (https://issues.apache.org/jira/browse/KAFKA-4514) are adopting it.
> Zstd seems to give great results => Gzip level Compression with Lz4 level CPU.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19959) df[java.lang.Long].collect throws NullPointerException if df includes null

2017-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19959:


Assignee: (was: Apache Spark)

> df[java.lang.Long].collect throws NullPointerException if df includes null
> --
>
> Key: SPARK-19959
> URL: https://issues.apache.org/jira/browse/SPARK-19959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> The following program throws {{NullPointerException}} during the execution of 
> Java code generated by the wholestage codegen.
> {code:java}
> sparkContext.parallelize(Seq[java.lang.Long](0L, null, 2L), 1).toDF.collect
> {code}
> {code}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:37)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:394)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19959) df[java.lang.Long].collect throws NullPointerException if df includes null

2017-03-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925873#comment-15925873
 ] 

Apache Spark commented on SPARK-19959:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/17302

> df[java.lang.Long].collect throws NullPointerException if df includes null
> --
>
> Key: SPARK-19959
> URL: https://issues.apache.org/jira/browse/SPARK-19959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> The following program throws {{NullPointerException}} during the execution of 
> Java code generated by the wholestage codegen.
> {code:java}
> sparkContext.parallelize(Seq[java.lang.Long](0L, null, 2L), 1).toDF.collect
> {code}
> {code}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:37)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:394)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19959) df[java.lang.Long].collect throws NullPointerException if df includes null

2017-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19959:


Assignee: Apache Spark

> df[java.lang.Long].collect throws NullPointerException if df includes null
> --
>
> Key: SPARK-19959
> URL: https://issues.apache.org/jira/browse/SPARK-19959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> The following program throws {{NullPointerException}} during the execution of 
> Java code generated by the wholestage codegen.
> {code:java}
> sparkContext.parallelize(Seq[java.lang.Long](0L, null, 2L), 1).toDF.collect
> {code}
> {code}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:37)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:394)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19946) DebugFilesystem.assertNoOpenStreams should report the open streams to help debugging

2017-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19946:


Assignee: (was: Apache Spark)

> DebugFilesystem.assertNoOpenStreams should report the open streams to help 
> debugging
> 
>
> Key: SPARK-19946
> URL: https://issues.apache.org/jira/browse/SPARK-19946
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Bogdan Raducanu
>Priority: Minor
>
> In DebugFilesystem.assertNoOpenStreams if there are open streams an exception 
> is thrown showing the number of open streams. This doesn't help much to debug 
> where the open streams were leaked.
> The exception should also report where the stream was leaked. This can be 
> done through a cause exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19946) DebugFilesystem.assertNoOpenStreams should report the open streams to help debugging

2017-03-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19946:


Assignee: Apache Spark

> DebugFilesystem.assertNoOpenStreams should report the open streams to help 
> debugging
> 
>
> Key: SPARK-19946
> URL: https://issues.apache.org/jira/browse/SPARK-19946
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Bogdan Raducanu
>Assignee: Apache Spark
>Priority: Minor
>
> In DebugFilesystem.assertNoOpenStreams if there are open streams an exception 
> is thrown showing the number of open streams. This doesn't help much to debug 
> where the open streams were leaked.
> The exception should also report where the stream was leaked. This can be 
> done through a cause exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19946) DebugFilesystem.assertNoOpenStreams should report the open streams to help debugging

2017-03-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925856#comment-15925856
 ] 

Apache Spark commented on SPARK-19946:
--

User 'bogdanrdc' has created a pull request for this issue:
https://github.com/apache/spark/pull/17292

> DebugFilesystem.assertNoOpenStreams should report the open streams to help 
> debugging
> 
>
> Key: SPARK-19946
> URL: https://issues.apache.org/jira/browse/SPARK-19946
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Bogdan Raducanu
>Priority: Minor
>
> In DebugFilesystem.assertNoOpenStreams if there are open streams an exception 
> is thrown showing the number of open streams. This doesn't help much to debug 
> where the open streams were leaked.
> The exception should also report where the stream was leaked. This can be 
> done through a cause exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18591) Replace hash-based aggregates with sort-based ones if inputs already sorted

2017-03-15 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925841#comment-15925841
 ] 

Herman van Hovell commented on SPARK-18591:
---

[~maropu] I don't think the current planning (of aggregates) is optimal - I 
think your previous work on merging partial aggregates highlights this. I think 
we should move back to bottom-up planning. This is also very relevant for the 
current CBO work.

cc [~cloud_fan] We discussed something like this a while back
cc [~ueshin] You have implemented the current top-down work. Is there an easy 
way to move back to bottom up?

> Replace hash-based aggregates with sort-based ones if inputs already sorted
> ---
>
> Key: SPARK-18591
> URL: https://issues.apache.org/jira/browse/SPARK-18591
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Takeshi Yamamuro
>
> Spark currently uses sort-based aggregates only in limited condition; the 
> cases where spark cannot use partial aggregates and hash-based ones.
> However, if input ordering has already satisfied the requirements of 
> sort-based aggregates, it seems sort-based ones are faster than the other.
> {code}
> ./bin/spark-shell --conf spark.sql.shuffle.partitions=1
> val df = spark.range(1000).selectExpr("id AS key", "id % 10 AS 
> value").sort($"key").cache
> def timer[R](block: => R): R = {
>   val t0 = System.nanoTime()
>   val result = block
>   val t1 = System.nanoTime()
>   println("Elapsed time: " + ((t1 - t0 + 0.0) / 10.0)+ "s")
>   result
> }
> timer {
>   df.groupBy("key").count().count
> }
> // codegen'd hash aggregate
> Elapsed time: 7.116962977s
> // non-codegen'd sort aggregarte
> Elapsed time: 3.088816662s
> {code}
> If codegen'd sort-based aggregates are supported in SPARK-16844, this seems 
> to make the performance gap bigger;
> {code}
> - codegen'd sort aggregate
> Elapsed time: 1.645234684s
> {code} 
> Therefore, it'd be better to use sort-based ones in this case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19959) df[java.lang.Long].collect throws NullPointerException if df includes null

2017-03-15 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-19959:


 Summary: df[java.lang.Long].collect throws NullPointerException if 
df includes null
 Key: SPARK-19959
 URL: https://issues.apache.org/jira/browse/SPARK-19959
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kazuaki Ishizaki


The following program throws {{NullPointerException}} during the execution of 
Java code generated by the wholestage codegen.

{code:java}
sparkContext.parallelize(Seq[java.lang.Long](0L, null, 2L), 1).toDF.collect
{code}

{code}
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:37)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:394)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19889) Make TaskContext callbacks synchronized

2017-03-15 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-19889.
---
   Resolution: Fixed
 Assignee: Herman van Hovell
Fix Version/s: 2.2.0

> Make TaskContext callbacks synchronized
> ---
>
> Key: SPARK-19889
> URL: https://issues.apache.org/jira/browse/SPARK-19889
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.2.0
>
>
> In some cases you want to fork of some part of a task to a different thread. 
> In these cases it would be very useful if {{TaskContext}} callbacks are 
> synchronized and that we can synchronize on the task context.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored

2017-03-15 Thread Sergey Rubtsov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925775#comment-15925775
 ] 

Sergey Rubtsov commented on SPARK-19228:


Hi [~hyukjin.kwon], 
Updated pull request: 
https://github.com/apache/spark/pull/16735
Please, take a look.
Couldn't run tests in CSVSuite locally on my Windows OS, apologize for the 
possible test fails

> inferSchema function processed csv date column as string and "dateFormat" 
> DataSource option is ignored
> --
>
> Key: SPARK-19228
> URL: https://issues.apache.org/jira/browse/SPARK-19228
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.0
>Reporter: Sergey Rubtsov
>  Labels: easyfix
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> I need to process user.csv like this:
> {code}
> id,project,started,ended
> sergey.rubtsov,project0,12/12/2012,10/10/2015
> {code}
> When I add date format options:
> {code}
> Dataset users = spark.read().format("csv").option("mode", 
> "PERMISSIVE").option("header", "true")
> .option("inferSchema", 
> "true").option("dateFormat", 
> "dd/MM/").load("src/main/resources/user.csv");
>   users.printSchema();
> {code}
> expected scheme should be 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: date (nullable = true)
>  |-- ended: date (nullable = true)
> {code}
> but the actual result is: 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: string (nullable = true)
>  |-- ended: string (nullable = true)
> {code}
> This mean that date processed as string and "dateFormat" option is ignored.
> If I add option 
> {code}
> .option("timestampFormat", "dd/MM/")
> {code}
> result is: 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: timestamp (nullable = true)
>  |-- ended: timestamp (nullable = true)
> {code}
> I think, the issue is somewhere in object CSVInferSchema, function 
> inferField, lines 80-97 and
> method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
> date/timestamp process logic need to be changed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.

2017-03-15 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19954.
--
Resolution: Cannot Reproduce

This seems fixed in the current master. That code produces the result as below:

{code}
+---++---+++++
| id|name| id|name|colA|colB|colC|
+---++---+++++
|  1|   a|  1|   a|true|null|null|
|  2|   b|  2|   b|null|  10|null|
|  3|   c|  3|   c|null|null|9.73|
+---++---+++++
{code}

I am resolving this. It would be nice if someone identifies the JIRA fixing 
this and backports if needed.

> Joining to a unioned DataFrame does not produce expected result.
> 
>
> Key: SPARK-19954
> URL: https://issues.apache.org/jira/browse/SPARK-19954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Arun Allamsetty
>Priority: Blocker
>
> I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is 
> that when we try to join two DataFrames, one of which is a result of a union 
> operation, the result of the join results in data as if the table was joined 
> only to the first table in the union. This issue is not present in Spark 
> 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it.
> {noformat}
> import spark.implicits._
> import org.apache.spark.sql.functions.lit
> case class A(id: Long, colA: Boolean)
> case class B(id: Long, colB: Int)
> case class C(id: Long, colC: Double)
> case class X(id: Long, name: String)
> val aData = A(1, true) :: Nil
> val bData = B(2, 10) :: Nil
> val cData = C(3, 9.73D) :: Nil
> val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil
> val aDf = spark.createDataset(aData).toDF
> val bDf = spark.createDataset(bData).toDF
> val cDf = spark.createDataset(cData).toDF
> val xDf = spark.createDataset(xData).toDF
> val unionDf =
>   aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), 
> lit(null).as("colC")).union(
>   bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", 
> lit(null).as("colC"))).union(
>   cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), 
> lit(null).as("colB"), $"colC"))
> val result = xDf.join(unionDf, unionDf("name") === xDf("name") && 
> unionDf("id") === xDf("id"))
> result.show
> {noformat}
> The result being
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> +---++---+++++
> {noformat}
> Force computing {{unionDf}} using {{count}} does not help change the result 
> of the join. However, writing the data to disk and reading it back does give 
> the correct result. But it is definitely not ideal. Interestingly caching the 
> {{unionDf}} also gives the correct result.
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> |  2|   b|  2|   b|null|  10|null|
> |  3|   c|  3|   c|null|null|9.73|
> +---++---+++++
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.

2017-03-15 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925754#comment-15925754
 ] 

Takeshi Yamamuro commented on SPARK-19954:
--

It seems this issue has already fixed in branch-2.1. So, I think it's okay that 
you just wait for the next release v2.1.1 (I'm not sure about the release plan 
though).


> Joining to a unioned DataFrame does not produce expected result.
> 
>
> Key: SPARK-19954
> URL: https://issues.apache.org/jira/browse/SPARK-19954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Arun Allamsetty
>Priority: Blocker
>
> I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is 
> that when we try to join two DataFrames, one of which is a result of a union 
> operation, the result of the join results in data as if the table was joined 
> only to the first table in the union. This issue is not present in Spark 
> 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it.
> {noformat}
> import spark.implicits._
> import org.apache.spark.sql.functions.lit
> case class A(id: Long, colA: Boolean)
> case class B(id: Long, colB: Int)
> case class C(id: Long, colC: Double)
> case class X(id: Long, name: String)
> val aData = A(1, true) :: Nil
> val bData = B(2, 10) :: Nil
> val cData = C(3, 9.73D) :: Nil
> val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil
> val aDf = spark.createDataset(aData).toDF
> val bDf = spark.createDataset(bData).toDF
> val cDf = spark.createDataset(cData).toDF
> val xDf = spark.createDataset(xData).toDF
> val unionDf =
>   aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), 
> lit(null).as("colC")).union(
>   bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", 
> lit(null).as("colC"))).union(
>   cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), 
> lit(null).as("colB"), $"colC"))
> val result = xDf.join(unionDf, unionDf("name") === xDf("name") && 
> unionDf("id") === xDf("id"))
> result.show
> {noformat}
> The result being
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> +---++---+++++
> {noformat}
> Force computing {{unionDf}} using {{count}} does not help change the result 
> of the join. However, writing the data to disk and reading it back does give 
> the correct result. But it is definitely not ideal. Interestingly caching the 
> {{unionDf}} also gives the correct result.
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> |  2|   b|  2|   b|null|  10|null|
> |  3|   c|  3|   c|null|null|9.73|
> +---++---+++++
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19957) Inconsist KMeans initialization mode behavior between ML and MLlib

2017-03-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925753#comment-15925753
 ] 

Sean Owen commented on SPARK-19957:
---

Yeah I think this might be "working as intended".

> Inconsist KMeans initialization mode behavior between ML and MLlib
> --
>
> Key: SPARK-19957
> URL: https://issues.apache.org/jira/browse/SPARK-19957
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: yuhao yang
>Priority: Minor
>
> when users set the initialization mode to "random", KMeans in ML and MLlib 
> has inconsistent behavior for multiple runs:
> MLlib will basically use new Random for each run.
> ML Kmeans however will use the default random seed, which is 
> {code}this.getClass.getName.hashCode.toLong{code}, and keep using the same 
> number among multiple fitting.
> I would expect the "random" initialization mode to be literally random. 
> There're different solutions with different scope of impact. Adjusting the 
> hasSeed trait may have a broader impact(but maybe worth discussion). We can 
> always just set random default seed in KMeans. 
> Appreciate your feedback.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19958) Support ZStandard Compression

2017-03-15 Thread Lee Dongjin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lee Dongjin resolved SPARK-19958.
-
Resolution: Duplicate

See: https://issues.apache.org/jira/browse/SPARK-19112

> Support ZStandard Compression
> -
>
> Key: SPARK-19958
> URL: https://issues.apache.org/jira/browse/SPARK-19958
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Lee Dongjin
>
> Hadoop[^1] & HBase[^2] started to support ZStandard Compression from their 
> recent releases. Supporting this compression codec also requires adding a new 
> configuration for default compression level, for example, 
> 'spark.io.compression.zstandard.level.'
> [^1]: https://issues.apache.org/jira/browse/HADOOP-13578
> [^2]: https://issues.apache.org/jira/browse/HBASE-16710



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19936) Page rank example takes long time to complete

2017-03-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19936.
---
Resolution: Not A Problem

Set spark.blockManager.port  ?

> Page rank example takes long time to complete
> -
>
> Key: SPARK-19936
> URL: https://issues.apache.org/jira/browse/SPARK-19936
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Mesos
>Affects Versions: 2.1.0
> Environment: CentOS 7, Mesos 1.1.0
>Reporter: Stan Teresen
> Attachments: 1.stderr, 2.stderr, pr.out
>
>
> Sometimes Page Rank example takes very long time to finish on Mesos due to 
> exceptions on fetching remote block in RetryingBlockFetcher on executor sides.
> As it is seen in the log files attached it took ~30 min for an example to 
> complete in AWS 3 nodes environment (1 Mesos master and 2 agents).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19895) Spark SQL could not output a correct result

2017-03-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19895.
---
Resolution: Not A Problem

> Spark SQL could not output a correct result
> ---
>
> Key: SPARK-19895
> URL: https://issues.apache.org/jira/browse/SPARK-19895
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Bin Wu
>Priority: Minor
>  Labels: beginner
>
> I'm rewriting pagerank algorithm with Spark SQL, following code can output a 
> correct result for large data set, but fails on the small data set as I 
> provided. 
> from pyspark.sql.functions import *
> from pyspark.sql import SparkSession
> spark = SparkSession \
> .builder \
> .appName("Python Spark SQL basic example") \
> .config("spark.some.config.option", "some-value") \
> .getOrCreate()
> numOfIterations = 5   
>   
> lines = spark.read.text("pagerank_data.txt")
> a = lines.select(split(lines[0],' '))
> links = a.select(a[0][0].alias('src'), a[0][1].alias('dst'))
> outdegrees = links.groupBy('src').count()
> ranks = outdegrees.select('src', lit(1).alias('rank'))
> for iteration in range(numOfIterations):
> contribs = links.join(ranks, 'src').join(outdegrees, 'src').select('dst', 
> (ranks['rank']/outdegrees['count']).alias('contrib'))
> #ranks = 
> contribs.groupBy('dst').sum('contrib').select(column('dst').alias('src'), 
> (column('sum(contrib)')*0.85+0.15).alias('rank'))
> ranks = 
> contribs.withColumnRenamed('dst','dst').groupBy('dst').sum('contrib').select(column('dst').alias('src'),
>  (column('sum(contrib)')*0.85+0.15).alias('rank'))
> ranks.orderBy(desc('rank')).show()
> pagerank_data.txt:
> 1 2
> 1 3
> 1 4
> 2 1
> 3 1
> 4 1
> expected result:
> +---+--+
> |src|  rank|
> +---+--+
> |  1|2.326648124997|
> |  3|0.557783958333|
> |  2|0.557783958333|
> |  4|0.557783958333|
> +---+--+
> Wrong result (without "withColumnRenamed")
> +---+---+
> |src|   rank|
> +---+---+
> |  1| 2.326648124997|
> |  4|0.3142613194446|
> |  3|0.3142613194446|
> |  2|0.3142613194446|
> |  4|0.3142613194446|
> |  4|0.3142613194446|
> |  3|0.3142613194446|
> |  2|0.3142613194446|
> |  3|0.3142613194446|
> |  2|0.3142613194446|
> +---+---+
> It cannot output correct rank for each node for this small graph only if I 
> use the "withColumnRenamed". However, on large data set, the line without 
> withColumnRenamed works correctly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >