[jira] [Commented] (SPARK-25109) spark python should retry reading another datanode if the first one fails to connect

2018-08-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580692#comment-16580692
 ] 

Hyukjin Kwon commented on SPARK-25109:
--

It should be helpful if we can narrow down this problem.

> spark python should retry reading another datanode if the first one fails to 
> connect
> 
>
> Key: SPARK-25109
> URL: https://issues.apache.org/jira/browse/SPARK-25109
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Yuanbo Liu
>Priority: Major
> Attachments: 
> WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png
>
>
> We use this code to read parquet files from HDFS:
> spark.read.parquet('xxx')
> and get error as below:
> !WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png!
>  
> What we can get is that one of the replica block cannot be read for some 
> reason, but spark python doesn't try to read another replica which can be 
> read successfully. So the application fails after throwing exception.
> When I use hadoop fs -text to read the file, I can get content correctly. It 
> would be great that spark python can retry reading another replica block 
> instead of failing.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25109) spark python should retry reading another datanode if the first one fails to connect

2018-08-14 Thread Yuanbo Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580689#comment-16580689
 ] 

Yuanbo Liu commented on SPARK-25109:


[~hyukjin.kwon] Thanks for your comments. Not sure about that, we didn't use 
Scala API in our cluster.

> spark python should retry reading another datanode if the first one fails to 
> connect
> 
>
> Key: SPARK-25109
> URL: https://issues.apache.org/jira/browse/SPARK-25109
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Yuanbo Liu
>Priority: Major
> Attachments: 
> WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png
>
>
> We use this code to read parquet files from HDFS:
> spark.read.parquet('xxx')
> and get error as below:
> !WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png!
>  
> What we can get is that one of the replica block cannot be read for some 
> reason, but spark python doesn't try to read another replica which can be 
> read successfully. So the application fails after throwing exception.
> When I use hadoop fs -text to read the file, I can get content correctly. It 
> would be great that spark python can retry reading another replica block 
> instead of failing.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25120) EventLogListener may miss driver SparkListenerBlockManagerAdded event

2018-08-14 Thread deshanxiao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580670#comment-16580670
 ] 

deshanxiao commented on SPARK-25120:


Sure, I find the tab "Executors" in HistorySever sometimes miss the info of 
driver in executor-id colunm, it isn't convenient when we analysis the problem 
of driver. [~hyukjin.kwon]

> EventLogListener may miss driver SparkListenerBlockManagerAdded event 
> --
>
> Key: SPARK-25120
> URL: https://issues.apache.org/jira/browse/SPARK-25120
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: deshanxiao
>Priority: Major
>
> Sometimes in spark history tab "Executors" , it couldn't find driver 
> information because the event of SparkListenerBlockManagerAdded is lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25120) EventLogListener may miss driver SparkListenerBlockManagerAdded event

2018-08-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580658#comment-16580658
 ] 

Apache Spark commented on SPARK-25120:
--

User 'deshanxiao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22109

> EventLogListener may miss driver SparkListenerBlockManagerAdded event 
> --
>
> Key: SPARK-25120
> URL: https://issues.apache.org/jira/browse/SPARK-25120
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: deshanxiao
>Priority: Major
>
> Sometimes in spark history tab "Executors" , it couldn't find driver 
> information because the event of SparkListenerBlockManagerAdded is lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25120) EventLogListener may miss driver SparkListenerBlockManagerAdded event

2018-08-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25120:


Assignee: (was: Apache Spark)

> EventLogListener may miss driver SparkListenerBlockManagerAdded event 
> --
>
> Key: SPARK-25120
> URL: https://issues.apache.org/jira/browse/SPARK-25120
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: deshanxiao
>Priority: Major
>
> Sometimes in spark history tab "Executors" , it couldn't find driver 
> information because the event of SparkListenerBlockManagerAdded is lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25120) EventLogListener may miss driver SparkListenerBlockManagerAdded event

2018-08-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25120:


Assignee: Apache Spark

> EventLogListener may miss driver SparkListenerBlockManagerAdded event 
> --
>
> Key: SPARK-25120
> URL: https://issues.apache.org/jira/browse/SPARK-25120
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: deshanxiao
>Assignee: Apache Spark
>Priority: Major
>
> Sometimes in spark history tab "Executors" , it couldn't find driver 
> information because the event of SparkListenerBlockManagerAdded is lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25120) EventLogListener may miss driver SparkListenerBlockManagerAdded event

2018-08-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580648#comment-16580648
 ] 

Hyukjin Kwon commented on SPARK-25120:
--

Can you describe a bit more details? When does it happen? Also, can you upload 
screen shot and describe the expected output please?

> EventLogListener may miss driver SparkListenerBlockManagerAdded event 
> --
>
> Key: SPARK-25120
> URL: https://issues.apache.org/jira/browse/SPARK-25120
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: deshanxiao
>Priority: Major
>
> Sometimes in spark history tab "Executors" , it couldn't find driver 
> information because the event of SparkListenerBlockManagerAdded is lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25109) spark python should retry reading another datanode if the first one fails to connect

2018-08-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580647#comment-16580647
 ] 

Hyukjin Kwon commented on SPARK-25109:
--

Does the same thing happen in Scala API too?

> spark python should retry reading another datanode if the first one fails to 
> connect
> 
>
> Key: SPARK-25109
> URL: https://issues.apache.org/jira/browse/SPARK-25109
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Yuanbo Liu
>Priority: Major
> Attachments: 
> WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png
>
>
> We use this code to read parquet files from HDFS:
> spark.read.parquet('xxx')
> and get error as below:
> !WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png!
>  
> What we can get is that one of the replica block cannot be read for some 
> reason, but spark python doesn't try to read another replica which can be 
> read successfully. So the application fails after throwing exception.
> When I use hadoop fs -text to read the file, I can get content correctly. It 
> would be great that spark python can retry reading another replica block 
> instead of failing.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-14 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580644#comment-16580644
 ] 

Wenchen Fan commented on SPARK-24771:
-

I've sent the email, let's wait for the feedback.

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25120) EventLogListener may miss driver SparkListenerBlockManagerAdded event

2018-08-14 Thread deshanxiao (JIRA)
deshanxiao created SPARK-25120:
--

 Summary: EventLogListener may miss driver 
SparkListenerBlockManagerAdded event 
 Key: SPARK-25120
 URL: https://issues.apache.org/jira/browse/SPARK-25120
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: deshanxiao


Sometimes in spark history tab "Executors" , it couldn't find driver 
information because the event of SparkListenerBlockManagerAdded is lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25083) remove the type erasure hack in data source scan

2018-08-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25083:

Target Version/s: 3.0.0

> remove the type erasure hack in data source scan
> 
>
> Key: SPARK-25083
> URL: https://issues.apache.org/jira/browse/SPARK-25083
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> It's hacky to pretend a `RDD[ColumnarBatch]` to be a `RDD[InternalRow]`. We 
> should make the type explicit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25083) remove the type erasure hack in data source scan

2018-08-14 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580638#comment-16580638
 ] 

Wenchen Fan commented on SPARK-25083:
-

I've set the target version as 3.0. It's a code refactor and has no impact to 
end users.

> remove the type erasure hack in data source scan
> 
>
> Key: SPARK-25083
> URL: https://issues.apache.org/jira/browse/SPARK-25083
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> It's hacky to pretend a `RDD[ColumnarBatch]` to be a `RDD[InternalRow]`. We 
> should make the type explicit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-08-14 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23874.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21939
[https://github.com/apache/spark/pull/21939]

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25115) Eliminate extra memory copy done when a ByteBuf is used that is backed by > 1 ByteBuffer.

2018-08-14 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-25115.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22105
[https://github.com/apache/spark/pull/22105]

> Eliminate extra memory copy done when a ByteBuf is used that is backed by 
> > 1 ByteBuffer.
> -
>
> Key: SPARK-25115
> URL: https://issues.apache.org/jira/browse/SPARK-25115
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.3.1
>Reporter: Norman Maurer
>Assignee: Norman Maurer
>Priority: Major
> Fix For: 2.4.0
>
>
> Calling ByteBuf.nioBuffer(...) can be costly when the ByteBuf is backed by 
> more then 1 ByteBuf. In this case it makes more sense to call 
> ByteBuf.nioBuffers(...) and iterate over the returned ByteBuffer[].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25115) Eliminate extra memory copy done when a ByteBuf is used that is backed by > 1 ByteBuffer.

2018-08-14 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-25115:
---

Assignee: Norman Maurer

> Eliminate extra memory copy done when a ByteBuf is used that is backed by 
> > 1 ByteBuffer.
> -
>
> Key: SPARK-25115
> URL: https://issues.apache.org/jira/browse/SPARK-25115
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.3.1
>Reporter: Norman Maurer
>Assignee: Norman Maurer
>Priority: Major
> Fix For: 2.4.0
>
>
> Calling ByteBuf.nioBuffer(...) can be costly when the ByteBuf is backed by 
> more then 1 ByteBuf. In this case it makes more sense to call 
> ByteBuf.nioBuffers(...) and iterate over the returned ByteBuffer[].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25113) Add logging to CodeGenerator when any generated method's bytecode size goes above HugeMethodLimit

2018-08-14 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25113.
-
   Resolution: Fixed
 Assignee: Kris Mok
Fix Version/s: 2.4.0

> Add logging to CodeGenerator when any generated method's bytecode size goes 
> above HugeMethodLimit
> -
>
> Key: SPARK-25113
> URL: https://issues.apache.org/jira/browse/SPARK-25113
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
> Fix For: 2.4.0
>
>
> Add logging to help collect statistics on how often real world usage sees the 
> {{CodeGenerator}} generating methods whose bytecode size goes above the 8000 
> bytes (HugeMethodLimit) threshold.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25119) stages in wrong order within job page DAG chart

2018-08-14 Thread Yunjian Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yunjian Zhang updated SPARK-25119:
--
Attachment: Screen Shot 2018-08-14 at 3.35.34 PM.png

> stages in wrong order within job page DAG chart
> ---
>
> Key: SPARK-25119
> URL: https://issues.apache.org/jira/browse/SPARK-25119
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Yunjian Zhang
>Priority: Minor
> Attachments: Screen Shot 2018-08-14 at 3.35.34 PM.png
>
>
> {color:#33}multiple stages for same job are shown with wrong order in DAG 
> Visualization of job page.{color}
> e.g.
> stage27   stage19 stage20 stage24 stage21



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25119) stages in wrong order within job page DAG chart

2018-08-14 Thread Yunjian Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yunjian Zhang updated SPARK-25119:
--
Description: 
{color:#33}multiple stages for same job are shown with wrong order in DAG 
Visualization of job page.{color}

e.g.

stage27   stage19 stage20 stage24 stage21

  was:
multiple stages for same job are shown with wrong order in job page.

e.g.

 


> stages in wrong order within job page DAG chart
> ---
>
> Key: SPARK-25119
> URL: https://issues.apache.org/jira/browse/SPARK-25119
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Yunjian Zhang
>Priority: Minor
> Attachments: Screen Shot 2018-08-14 at 3.35.34 PM.png
>
>
> {color:#33}multiple stages for same job are shown with wrong order in DAG 
> Visualization of job page.{color}
> e.g.
> stage27   stage19 stage20 stage24 stage21



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25119) stages in wrong order within job page DAG chart

2018-08-14 Thread Yunjian Zhang (JIRA)
Yunjian Zhang created SPARK-25119:
-

 Summary: stages in wrong order within job page DAG chart
 Key: SPARK-25119
 URL: https://issues.apache.org/jira/browse/SPARK-25119
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.1
Reporter: Yunjian Zhang


multiple stages for same job are shown with wrong order in job page.

e.g.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20384) supporting value classes over primitives in DataSets

2018-08-14 Thread Minh Thai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580498#comment-16580498
 ] 

Minh Thai commented on SPARK-20384:
---

_(from my comment in SPARK-17368)_
 I think the main problem is there was no way to create an encoder specifically 
for value classes even until today. However, I think we can make a [universal 
trait|https://docs.scala-lang.org/overviews/core/value-classes.html] called 
{{OpaqueValue}}^1^ to be used as an upper type bound in encoder. This means:
 - Any user-defined value class has to mixin {{OpaqueValue}}
 - An encoder can be created to target those value classes.

{code:java}
trait OpaqueValue extends Any
implicit def newValueClassEncoder[T <: Product with OpaqueValue : TypeTag] = ???

case class Id(value: Int) extends AnyVal with OpaqueValue
{code}
this doesn't clash with the existing encoder for case class since the type 
constraint is more specific
{code:java}
implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] = 
Encoders.product[T]
{code}
_I'm experimenting with this on my fork and will make a PR if it works well._

_(1) the name is inspired from [Opaque 
Type|https://docs.scala-lang.org/sips/opaque-types.html] feature of Scala 3_

> supporting value classes over primitives in DataSets
> 
>
> Key: SPARK-20384
> URL: https://issues.apache.org/jira/browse/SPARK-20384
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Davis
>Priority: Minor
>
> As a spark user who uses value classes in scala for modelling domain objects, 
> I also would like to make use of them for datasets. 
> For example, I would like to use the {{User}} case class which is using a 
> value-class for it's {{id}} as the type for a DataSet:
> - the underlying primitive should be mapped to the value-class column
> - function on the column (for example comparison ) should only work if 
> defined on the value-class and use these implementation
> - show() should pick up the toString method of the value-class
> {code}
> case class Id(value: Long) extends AnyVal {
>   def toString: String = value.toHexString
> }
> case class User(id: Id, name: String)
> val ds = spark.sparkContext
>   .parallelize(0L to 12L).map(i => (i, f"name-$i")).toDS()
>   .withColumnRenamed("_1", "id")
>   .withColumnRenamed("_2", "name")
> // mapping should work
> val usrs = ds.as[User]
> // show should use toString
> usrs.show()
> // comparison with long should throw exception, as not defined on Id
> usrs.col("id") > 0L
> {code}
> For example `.show()` should use the toString of the `Id` value class:
> {noformat}
> +---+---+
> | id|   name|
> +---+---+
> |  0| name-0|
> |  1| name-1|
> |  2| name-2|
> |  3| name-3|
> |  4| name-4|
> |  5| name-5|
> |  6| name-6|
> |  7| name-7|
> |  8| name-8|
> |  9| name-9|
> |  A|name-10|
> |  B|name-11|
> |  C|name-12|
> +---+---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25092) Add RewriteExceptAll and RewriteIntersectAll in the list of nonExcludableRules

2018-08-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580464#comment-16580464
 ] 

Apache Spark commented on SPARK-25092:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/22108

> Add RewriteExceptAll and RewriteIntersectAll in the list of nonExcludableRules
> --
>
> Key: SPARK-25092
> URL: https://issues.apache.org/jira/browse/SPARK-25092
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.4.0
>
>
> Add RewriteExceptAll and RewriteIntersectAll in the list of 
> nonExcludableRules as the rewrites are essential for the functioning of 
> EXCEPT ALL and INTERSECT ALL feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25118) Need a solution to persist Spark application console outputs when running in shell/yarn client mode

2018-08-14 Thread Ankur Gupta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Gupta updated SPARK-25118:

Description: 
We execute Spark applications in YARN Client mode a lot of time. When we do so 
the Spark Driver logs are printed to the console.

We need a solution to persist the console outputs for later usage. This can be 
either for doing some troubleshooting or for some another log analysis. 
Ideally, we would like to persist these along with Yarn logs (when application 
is run in Yarn Client mode). Also, this has to be end-user agnostic, so that 
the logs are available for later usage without requiring the end-user to make 
some configuration changes.

  was:
We execute Spark applications in YARN Client mode a lot of time. When we do so 
the Spark Driver logs are printed to the console.

We need a solution to persist the console outputs for later usage. This can be 
either for doing some troubleshooting or for some another log analysis. 
Ideally, we would like to persist these along with Yarn logs (when application 
is run in Yarn Client mode).


> Need a solution to persist Spark application console outputs when running in 
> shell/yarn client mode
> ---
>
> Key: SPARK-25118
> URL: https://issues.apache.org/jira/browse/SPARK-25118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Ankur Gupta
>Priority: Major
>
> We execute Spark applications in YARN Client mode a lot of time. When we do 
> so the Spark Driver logs are printed to the console.
> We need a solution to persist the console outputs for later usage. This can 
> be either for doing some troubleshooting or for some another log analysis. 
> Ideally, we would like to persist these along with Yarn logs (when 
> application is run in Yarn Client mode). Also, this has to be end-user 
> agnostic, so that the logs are available for later usage without requiring 
> the end-user to make some configuration changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25118) Need a solution to persist Spark application console outputs when running in shell/yarn client mode

2018-08-14 Thread Ankur Gupta (JIRA)
Ankur Gupta created SPARK-25118:
---

 Summary: Need a solution to persist Spark application console 
outputs when running in shell/yarn client mode
 Key: SPARK-25118
 URL: https://issues.apache.org/jira/browse/SPARK-25118
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 2.3.0, 2.2.0, 2.1.0, 2.0.0
Reporter: Ankur Gupta


We execute Spark applications in YARN Client mode a lot of time. When we do so 
the Spark Driver logs are printed to the console.

We need a solution to persist the console outputs for later usage. This can be 
either for doing some troubleshooting or for some another log analysis. 
Ideally, we would like to persist these along with Yarn logs (when application 
is run in Yarn Client mode).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16406) Reference resolution for large number of columns should be faster

2018-08-14 Thread antonkulaga (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580444#comment-16580444
 ] 

antonkulaga commented on SPARK-16406:
-

Are you going to backport it to 2.3.2 as well?

> Reference resolution for large number of columns should be faster
> -
>
> Key: SPARK-16406
> URL: https://issues.apache.org/jira/browse/SPARK-16406
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Major
> Fix For: 2.4.0
>
>
> Resolving columns in a LogicalPlan on average takes n / 2 (n being the number 
> of columns). This gets problematic as soon as you try to resolve a large 
> number of columns (m) on a large table: O(m * n / 2)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23337) withWatermark raises an exception on struct objects

2018-08-14 Thread Aydin Kocas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580440#comment-16580440
 ] 

Aydin Kocas commented on SPARK-23337:
-

[~marmbrus] Can you give a hint how to do it with "withColumn" in java?

 
Dataset jsonRow = spark.readStream()
.schema(...)
.json(..).withColumn("createTime", ?? );.withWatermark("createTime", "10 
minutes"); 

> withWatermark raises an exception on struct objects
> ---
>
> Key: SPARK-23337
> URL: https://issues.apache.org/jira/browse/SPARK-23337
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1
> Environment: Linux Ubuntu, Spark on standalone mode
>Reporter: Aydin Kocas
>Priority: Major
>
> Hi,
>  
> when using a nested object (I mean an object within a struct, here concrete: 
> _source.createTime) from a json file as the parameter for the 
> withWatermark-method, I get an exception (see below).
> Anything else works flawlessly with the nested object.
>  
> +*{color:#14892c}works:{color}*+ 
> {code:java}
> Dataset jsonRow = 
> spark.readStream().schema(getJSONSchema()).json(file).dropDuplicates("_id").withWatermark("myTime",
>  "10 seconds").toDF();{code}
>  
> json structure:
> {code:java}
> root
>  |-- _id: string (nullable = true)
>  |-- _index: string (nullable = true)
>  |-- _score: long (nullable = true)
>  |-- myTime: timestamp (nullable = true)
> ..{code}
> +*{color:#d04437}does not work - nested json{color}:*+
> {code:java}
> Dataset jsonRow = 
> spark.readStream().schema(getJSONSchema()).json(file).dropDuplicates("_id").withWatermark("_source.createTime",
>  "10 seconds").toDF();{code}
>  
> json structure:
>  
> {code:java}
> root
>  |-- _id: string (nullable = true)
>  |-- _index: string (nullable = true)
>  |-- _score: long (nullable = true)
>  |-- _source: struct (nullable = true)
>  | |-- createTime: timestamp (nullable = true)
> ..
>  
> Exception in thread "main" 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree:
> 'EventTimeWatermark '_source.createTime, interval 10 seconds
> +- Deduplicate [_id#0], true
>  +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@5dbbb292,json,List(),Some(StructType(StructField(_id,StringType,true),
>  StructField(_index,StringType,true), StructField(_score,LongType,true), 
> StructField(_source,StructType(StructField(additionalData,StringType,true), 
> StructField(client,StringType,true), 
> StructField(clientDomain,BooleanType,true), 
> StructField(clientVersion,StringType,true), 
> StructField(country,StringType,true), 
> StructField(countryName,StringType,true), 
> StructField(createTime,TimestampType,true), 
> StructField(externalIP,StringType,true), 
> StructField(hostname,StringType,true), 
> StructField(internalIP,StringType,true), 
> StructField(location,StringType,true), 
> StructField(locationDestination,StringType,true), 
> StructField(login,StringType,true), 
> StructField(originalRequestString,StringType,true), 
> StructField(password,StringType,true), 
> StructField(peerIdent,StringType,true), 
> StructField(peerType,StringType,true), 
> StructField(recievedTime,TimestampType,true), 
> StructField(sessionEnd,StringType,true), 
> StructField(sessionStart,StringType,true), 
> StructField(sourceEntryAS,StringType,true), 
> StructField(sourceEntryIp,StringType,true), 
> StructField(sourceEntryPort,StringType,true), 
> StructField(targetCountry,StringType,true), 
> StructField(targetCountryName,StringType,true), 
> StructField(targetEntryAS,StringType,true), 
> StructField(targetEntryIp,StringType,true), 
> StructField(targetEntryPort,StringType,true), 
> StructField(targetport,StringType,true), 
> StructField(username,StringType,true), 
> StructField(vulnid,StringType,true)),true), 
> StructField(_type,StringType,true))),List(),None,Map(path -> ./input/),None), 
> FileSource[./input/], [_id#0, _index#1, _score#2L, _source#3, _type#4]
> at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:268)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:854)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:796)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
>  at 
> 

[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools

2018-08-14 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580299#comment-16580299
 ] 

Imran Rashid commented on SPARK-24938:
--

{quote} So perhaps the fix here is not to use the default netty pool, but to 
use ctx.alloc().buffer() instead of .heapBuffer()? {quote}

yeah thats what I meant -- that {{alloc().buffer()}} could be onheap or 
offheap, depending on how netty's alloctor is configured (in spark via 
spark..io.preferDirectBufs).

> Understand usage of netty's onheap memory use, even with offheap pools
> --
>
> Key: SPARK-24938
> URL: https://issues.apache.org/jira/browse/SPARK-24938
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> We've observed that netty uses large amount of onheap memory in its pools, in 
> addition to the expected offheap memory when I added some instrumentation 
> (using SPARK-24918 and https://github.com/squito/spark-memory). We should 
> figure out why its using that memory, and whether its really necessary.
> It might be just this one line:
> https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82
> which means that even with a small burst of messages, each arena will grow by 
> 16MB which could lead to a 128 MB spike of an almost entirely unused pool.  
> Switching to requesting a buffer from the default pool would probably fix 
> this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24838) Support uncorrelated IN/EXISTS subqueries for more operators

2018-08-14 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24838:

Target Version/s: 2.4.0

> Support uncorrelated IN/EXISTS subqueries for more operators 
> -
>
> Key: SPARK-24838
> URL: https://issues.apache.org/jira/browse/SPARK-24838
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Qifan Pu
>Priority: Major
>
> Currently, CheckAnalysis allows IN/EXISTS subquery only for filter operators. 
> Running a query:
> {{select name in (select * from valid_names)}}
> {{from all_names}}
> returns error:
> {code:java}
> Error in SQL statement: AnalysisException: IN/EXISTS predicate sub-queries 
> can only be used in a Filter
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24838) Support uncorrelated IN/EXISTS subqueries for more operators

2018-08-14 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580298#comment-16580298
 ] 

Xiao Li commented on SPARK-24838:
-

[~maurits] Any update? 

Also cc [~liwen] [~hvanhovell]



> Support uncorrelated IN/EXISTS subqueries for more operators 
> -
>
> Key: SPARK-24838
> URL: https://issues.apache.org/jira/browse/SPARK-24838
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Qifan Pu
>Priority: Major
>
> Currently, CheckAnalysis allows IN/EXISTS subquery only for filter operators. 
> Running a query:
> {{select name in (select * from valid_names)}}
> {{from all_names}}
> returns error:
> {code:java}
> Error in SQL statement: AnalysisException: IN/EXISTS predicate sub-queries 
> can only be used in a Filter
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24838) Support uncorrelated IN/EXISTS subqueries for more operators

2018-08-14 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24838:

Labels:   (was: spree)

> Support uncorrelated IN/EXISTS subqueries for more operators 
> -
>
> Key: SPARK-24838
> URL: https://issues.apache.org/jira/browse/SPARK-24838
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Qifan Pu
>Priority: Major
>
> Currently, CheckAnalysis allows IN/EXISTS subquery only for filter operators. 
> Running a query:
> {{select name in (select * from valid_names)}}
> {{from all_names}}
> returns error:
> {code:java}
> Error in SQL statement: AnalysisException: IN/EXISTS predicate sub-queries 
> can only be used in a Filter
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25105) Importing all of pyspark.sql.functions should bring PandasUDFType in as well

2018-08-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580285#comment-16580285
 ] 

Apache Spark commented on SPARK-25105:
--

User 'kevinyu98' has created a pull request for this issue:
https://github.com/apache/spark/pull/22100

> Importing all of pyspark.sql.functions should bring PandasUDFType in as well
> 
>
> Key: SPARK-25105
> URL: https://issues.apache.org/jira/browse/SPARK-25105
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
>  
> {code:java}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> Traceback (most recent call last):
>  File "", line 1, in 
> NameError: name 'PandasUDFType' is not defined
>  
> {code}
> When explicitly imported it works fine:
> {code:java}
>  
> >>> from pyspark.sql.functions import PandasUDFType
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> {code}
>  
> We just need to make sure it's included in __all__/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25105) Importing all of pyspark.sql.functions should bring PandasUDFType in as well

2018-08-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25105:


Assignee: (was: Apache Spark)

> Importing all of pyspark.sql.functions should bring PandasUDFType in as well
> 
>
> Key: SPARK-25105
> URL: https://issues.apache.org/jira/browse/SPARK-25105
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
>  
> {code:java}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> Traceback (most recent call last):
>  File "", line 1, in 
> NameError: name 'PandasUDFType' is not defined
>  
> {code}
> When explicitly imported it works fine:
> {code:java}
>  
> >>> from pyspark.sql.functions import PandasUDFType
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> {code}
>  
> We just need to make sure it's included in __all__/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25105) Importing all of pyspark.sql.functions should bring PandasUDFType in as well

2018-08-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25105:


Assignee: Apache Spark

> Importing all of pyspark.sql.functions should bring PandasUDFType in as well
> 
>
> Key: SPARK-25105
> URL: https://issues.apache.org/jira/browse/SPARK-25105
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Trivial
>
>  
> {code:java}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> Traceback (most recent call last):
>  File "", line 1, in 
> NameError: name 'PandasUDFType' is not defined
>  
> {code}
> When explicitly imported it works fine:
> {code:java}
>  
> >>> from pyspark.sql.functions import PandasUDFType
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> {code}
>  
> We just need to make sure it's included in __all__/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21375) Add date and timestamp support to ArrowConverters for toPandas() collection

2018-08-14 Thread Eric Wohlstadter (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580275#comment-16580275
 ] 

Eric Wohlstadter commented on SPARK-21375:
--

[~bryanc]

Hi Brian,

 I'm using the Spark-Arrow conversion support inside of a DataSourceV2  
{{SupportsColumnBatchScan}} DataReader. It uses {{ArrowStreamReader}} to read 
from the external data source, and converts the input from the stream to 
Spark's {{ArrowColumnVector}}.

I'm having trouble when the original input comes from a Hive TimeStamp (without 
timezone). It looks like {{ArrowColumnVector}} requires 
{{TimeStampMicroTZVector.}}

So I need to fill in a time zone when creating the {{TimeStampMicroTZVector}} 
on the Writer-side of the arrow stream.

This creates some inconsistency when the two ends of the arrow stream are in 
different time zones. 

I'm wondering if I might be missing some other way of handling this correctly. 
Would you happen to know a better way to handle conversion of Timestamp 
(without time zone) using the Spark-Arrow conversion support?

 

/cc [~dongjoon] [~hyukjin.kwon]

> Add date and timestamp support to ArrowConverters for toPandas() collection
> ---
>
> Key: SPARK-21375
> URL: https://issues.apache.org/jira/browse/SPARK-21375
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.3.0
>
>
> Date and timestamp are not yet supported in DataFrame.toPandas() using 
> ArrowConverters.  These are common types for data analysis used in both Spark 
> and Pandas and should be supported.
> There is a discrepancy with the way that PySpark and Arrow store timestamps, 
> without timezone specified, internally.  PySpark takes a UTC timestamp that 
> is adjusted to local time and Arrow is in UTC time.  Hopefully there is a 
> clean way to resolve this.
> Spark internal storage spec:
> * *DateType* stored as days
> * *Timestamp* stored as microseconds 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools

2018-08-14 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580260#comment-16580260
 ] 

Marcelo Vanzin commented on SPARK-24938:


I think that unless there's a measurable performance issue, we should switch to 
using the allocator's default mode. Otherwise, since these headers are pretty 
small, we could just switch to unpooled buffers.

> Understand usage of netty's onheap memory use, even with offheap pools
> --
>
> Key: SPARK-24938
> URL: https://issues.apache.org/jira/browse/SPARK-24938
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> We've observed that netty uses large amount of onheap memory in its pools, in 
> addition to the expected offheap memory when I added some instrumentation 
> (using SPARK-24918 and https://github.com/squito/spark-memory). We should 
> figure out why its using that memory, and whether its really necessary.
> It might be just this one line:
> https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82
> which means that even with a small burst of messages, each arena will grow by 
> 16MB which could lead to a 128 MB spike of an almost entirely unused pool.  
> Switching to requesting a buffer from the default pool would probably fix 
> this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180

2018-08-14 Thread Joe Pallas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580229#comment-16580229
 ] 

Joe Pallas commented on SPARK-22236:


{quote}people with preexisting datasets exported by Spark would suffer 
(unexpected) data loss
{quote}
Yes, that is an important consideration. But is the implication that we can 
never change the default behavior to match the standard, *even in a major 
revision*, because of the impact on reading previously written data? I don't 
see how that would be sensible.

So, how can we minimize the risk associated with making this change?
 * Documentation is certainly one way, in both the API doc and the release 
notes.
 * We could add a {{spark2-compat}} option (equivalent to setting escape, but 
the name emphasizes the compatibility semantics).
 * We could flag a warning or error if we see {{\"}} in the input. Since that 
could appear in legal RFC4180 input, we might then want an option to suppress 
the warning/error when we know the input was not generated with the old 
settings.

I've no idea how difficult the implementation would be on that last one.

> CSV I/O: does not respect RFC 4180
> --
>
> Key: SPARK-22236
> URL: https://issues.apache.org/jira/browse/SPARK-22236
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.2.0
>Reporter: Ondrej Kokes
>Priority: Minor
>
> When reading or writing CSV files with Spark, double quotes are escaped with 
> a backslash by default. However, the appropriate behaviour as set out by RFC 
> 4180 (and adhered to by many software packages) is to escape using a second 
> double quote.
> This piece of Python code demonstrates the issue
> {code}
> import csv
> with open('testfile.csv', 'w') as f:
> cw = csv.writer(f)
> cw.writerow(['a 2.5" drive', 'another column'])
> cw.writerow(['a "quoted" string', '"quoted"'])
> cw.writerow([1,2])
> with open('testfile.csv') as f:
> print(f.read())
> # "a 2.5"" drive",another column
> # "a ""quoted"" string","""quoted"""
> # 1,2
> spark.read.csv('testfile.csv').collect()
> # [Row(_c0='"a 2.5"" drive"', _c1='another column'),
> #  Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'),
> #  Row(_c0='1', _c1='2')]
> # explicitly stating the escape character fixed the issue
> spark.read.option('escape', '"').csv('testfile.csv').collect()
> # [Row(_c0='a 2.5" drive', _c1='another column'),
> #  Row(_c0='a "quoted" string', _c1='"quoted"'),
> #  Row(_c0='1', _c1='2')]
> {code}
> The same applies to writes, where reading the file written by Spark may 
> result in garbage.
> {code}
> df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file 
> correctly
> df.write.format("csv").save('testout.csv')
> with open('testout.csv/part-csv') as f:
> cr = csv.reader(f)
> print(next(cr))
> print(next(cr))
> # ['a 2.5\\ drive"', 'another column']
> # ['a \\quoted\\" string"', '\\quoted\\""']
> {code}
> The culprit is in 
> [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91],
>  where the default escape character is overridden.
> While it's possible to work with CSV files in a "compatible" manner, it would 
> be useful if Spark had sensible defaults that conform to the above-mentioned 
> RFC (as well as W3C recommendations). I realise this would be a breaking 
> change and thus if accepted, it would probably need to result in a warning 
> first, before moving to a new default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools

2018-08-14 Thread Nihar Sheth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580225#comment-16580225
 ] 

Nihar Sheth commented on SPARK-24938:
-

His comment for that change is "The header is a very small buffer, I thought it 
was overkill to try to get direct byte bufs for it." Not sure why that would be 
a thing, though, wouldn't the already-allocated offheap memory be used?

> Understand usage of netty's onheap memory use, even with offheap pools
> --
>
> Key: SPARK-24938
> URL: https://issues.apache.org/jira/browse/SPARK-24938
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> We've observed that netty uses large amount of onheap memory in its pools, in 
> addition to the expected offheap memory when I added some instrumentation 
> (using SPARK-24918 and https://github.com/squito/spark-memory). We should 
> figure out why its using that memory, and whether its really necessary.
> It might be just this one line:
> https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82
> which means that even with a small burst of messages, each arena will grow by 
> 16MB which could lead to a 128 MB spike of an almost entirely unused pool.  
> Switching to requesting a buffer from the default pool would probably fix 
> this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25043) spark-sql should print the appId and master on startup

2018-08-14 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-25043:
-

Assignee: Alessandro Bellina

> spark-sql should print the appId and master on startup
> --
>
> Key: SPARK-25043
> URL: https://issues.apache.org/jira/browse/SPARK-25043
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Alessandro Bellina
>Assignee: Alessandro Bellina
>Priority: Trivial
> Fix For: 2.4.0
>
>
> In spark-sql, if logging is turned down all the way, it's not possible to 
> find out what appId is running at the moment. This small change as a print to 
> stdout containing the master type and the appId to have that on screen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools

2018-08-14 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580213#comment-16580213
 ] 

Marcelo Vanzin commented on SPARK-24938:


BTW the change from buffer() to heapBuffer() was made in SPARK-4188, but I 
don't really expect Aaron to remember why at this point (if we can even reach 
him).

> Understand usage of netty's onheap memory use, even with offheap pools
> --
>
> Key: SPARK-24938
> URL: https://issues.apache.org/jira/browse/SPARK-24938
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> We've observed that netty uses large amount of onheap memory in its pools, in 
> addition to the expected offheap memory when I added some instrumentation 
> (using SPARK-24918 and https://github.com/squito/spark-memory). We should 
> figure out why its using that memory, and whether its really necessary.
> It might be just this one line:
> https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82
> which means that even with a small burst of messages, each arena will grow by 
> 16MB which could lead to a 128 MB spike of an almost entirely unused pool.  
> Switching to requesting a buffer from the default pool would probably fix 
> this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25043) spark-sql should print the appId and master on startup

2018-08-14 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-25043.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

> spark-sql should print the appId and master on startup
> --
>
> Key: SPARK-25043
> URL: https://issues.apache.org/jira/browse/SPARK-25043
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Alessandro Bellina
>Priority: Trivial
> Fix For: 2.4.0
>
>
> In spark-sql, if logging is turned down all the way, it's not possible to 
> find out what appId is running at the moment. This small change as a print to 
> stdout containing the master type and the appId to have that on screen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25117) Add EXEPT ALL and INTERSECT ALL support in R.

2018-08-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580208#comment-16580208
 ] 

Apache Spark commented on SPARK-25117:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/22107

> Add EXEPT ALL and INTERSECT ALL support in R.
> -
>
> Key: SPARK-25117
> URL: https://issues.apache.org/jira/browse/SPARK-25117
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25117) Add EXEPT ALL and INTERSECT ALL support in R.

2018-08-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25117:


Assignee: (was: Apache Spark)

> Add EXEPT ALL and INTERSECT ALL support in R.
> -
>
> Key: SPARK-25117
> URL: https://issues.apache.org/jira/browse/SPARK-25117
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25117) Add EXEPT ALL and INTERSECT ALL support in R.

2018-08-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25117:


Assignee: Apache Spark

> Add EXEPT ALL and INTERSECT ALL support in R.
> -
>
> Key: SPARK-25117
> URL: https://issues.apache.org/jira/browse/SPARK-25117
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25116) Fix the "exit code 1" error when terminating Kafka tests

2018-08-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25116:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Fix the "exit code 1" error when terminating Kafka tests
> 
>
> Key: SPARK-25116
> URL: https://issues.apache.org/jira/browse/SPARK-25116
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> SBT may report the following error when all Kafka tests are finished
> {code}
> sbt/sbt/0.13.17/test-interface-1.0.jar sbt.ForkMain 39359 failed with exit 
> code 1
> [error] (sql-kafka-0-10/test:test) sbt.TestsFailedException: Tests 
> unsuccessful
> {code}
> This is because we are leaking a Kafka cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25116) Fix the "exit code 1" error when terminating Kafka tests

2018-08-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580202#comment-16580202
 ] 

Apache Spark commented on SPARK-25116:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/22106

> Fix the "exit code 1" error when terminating Kafka tests
> 
>
> Key: SPARK-25116
> URL: https://issues.apache.org/jira/browse/SPARK-25116
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> SBT may report the following error when all Kafka tests are finished
> {code}
> sbt/sbt/0.13.17/test-interface-1.0.jar sbt.ForkMain 39359 failed with exit 
> code 1
> [error] (sql-kafka-0-10/test:test) sbt.TestsFailedException: Tests 
> unsuccessful
> {code}
> This is because we are leaking a Kafka cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25116) Fix the "exit code 1" error when terminating Kafka tests

2018-08-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25116:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Fix the "exit code 1" error when terminating Kafka tests
> 
>
> Key: SPARK-25116
> URL: https://issues.apache.org/jira/browse/SPARK-25116
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> SBT may report the following error when all Kafka tests are finished
> {code}
> sbt/sbt/0.13.17/test-interface-1.0.jar sbt.ForkMain 39359 failed with exit 
> code 1
> [error] (sql-kafka-0-10/test:test) sbt.TestsFailedException: Tests 
> unsuccessful
> {code}
> This is because we are leaking a Kafka cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools

2018-08-14 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580195#comment-16580195
 ] 

Marcelo Vanzin commented on SPARK-24938:


The line you mention is this, right?

{code}
ByteBuf header = ctx.alloc().heapBuffer(headerLength);
{code}

My understanding of that line is that it's using the allocator used when 
building the client or server. So perhaps the fix here is not to use the 
default netty pool, but to use {{ctx.alloc().buffer()}} instead of 
{{.heapBuffer()}}? Seems that this way you'd be actually using the shared 
buffers when the allocator is configured for direct buffers, instead of 
initializing a heap pool just for the message encoder...

> Understand usage of netty's onheap memory use, even with offheap pools
> --
>
> Key: SPARK-24938
> URL: https://issues.apache.org/jira/browse/SPARK-24938
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> We've observed that netty uses large amount of onheap memory in its pools, in 
> addition to the expected offheap memory when I added some instrumentation 
> (using SPARK-24918 and https://github.com/squito/spark-memory). We should 
> figure out why its using that memory, and whether its really necessary.
> It might be just this one line:
> https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82
> which means that even with a small burst of messages, each arena will grow by 
> 16MB which could lead to a 128 MB spike of an almost entirely unused pool.  
> Switching to requesting a buffer from the default pool would probably fix 
> this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25088) Rest Server default & doc updates

2018-08-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25088.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Resolved by https://github.com/apache/spark/pull/22071

> Rest Server default & doc updates
> -
>
> Key: SPARK-25088
> URL: https://issues.apache.org/jira/browse/SPARK-25088
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> The rest server could use some updates on defaults & docs, both in standalone 
> and mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25117) Add EXEPT ALL and INTERSECT ALL support in R.

2018-08-14 Thread Dilip Biswal (JIRA)
Dilip Biswal created SPARK-25117:


 Summary: Add EXEPT ALL and INTERSECT ALL support in R.
 Key: SPARK-25117
 URL: https://issues.apache.org/jira/browse/SPARK-25117
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.1
Reporter: Dilip Biswal






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25116) Fix the "exit code 1" error when terminating Kafka tests

2018-08-14 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-25116:


 Summary: Fix the "exit code 1" error when terminating Kafka tests
 Key: SPARK-25116
 URL: https://issues.apache.org/jira/browse/SPARK-25116
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


SBT may report the following error when all Kafka tests are finished
{code}
sbt/sbt/0.13.17/test-interface-1.0.jar sbt.ForkMain 39359 failed with exit code 
1
[error] (sql-kafka-0-10/test:test) sbt.TestsFailedException: Tests unsuccessful
{code}

This is because we are leaking a Kafka cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE

2018-08-14 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25114:

Target Version/s: 2.3.2, 2.4.0  (was: 2.4.0)

> RecordBinaryComparator may return wrong result when subtraction between two 
> words is divisible by Integer.MAX_VALUE
> ---
>
> Key: SPARK-25114
> URL: https://issues.apache.org/jira/browse/SPARK-25114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Priority: Blocker
>  Labels: correctness
>
> It is possible for two objects to be unequal and yet we consider them as 
> equal within RecordBinaryComparator, if the long values are separated by 
> Int.MaxValue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE

2018-08-14 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580153#comment-16580153
 ] 

Xiao Li commented on SPARK-25114:
-

[~jerryshao] Another blocker for 2.3

> RecordBinaryComparator may return wrong result when subtraction between two 
> words is divisible by Integer.MAX_VALUE
> ---
>
> Key: SPARK-25114
> URL: https://issues.apache.org/jira/browse/SPARK-25114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Priority: Blocker
>  Labels: correctness
>
> It is possible for two objects to be unequal and yet we consider them as 
> equal within RecordBinaryComparator, if the long values are separated by 
> Int.MaxValue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools

2018-08-14 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580140#comment-16580140
 ] 

Imran Rashid commented on SPARK-24938:
--

Cool, sounds like the info we need for making this change, then.  [~zsxwing] 
[~vanzin] do you have thoughts on this?  Any reason why MessageEncoder is 
explicitly using onheap pools, rather than the configured default netty pool?

> Understand usage of netty's onheap memory use, even with offheap pools
> --
>
> Key: SPARK-24938
> URL: https://issues.apache.org/jira/browse/SPARK-24938
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> We've observed that netty uses large amount of onheap memory in its pools, in 
> addition to the expected offheap memory when I added some instrumentation 
> (using SPARK-24918 and https://github.com/squito/spark-memory). We should 
> figure out why its using that memory, and whether its really necessary.
> It might be just this one line:
> https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82
> which means that even with a small burst of messages, each arena will grow by 
> 16MB which could lead to a 128 MB spike of an almost entirely unused pool.  
> Switching to requesting a buffer from the default pool would probably fix 
> this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-14 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25051.
-
   Resolution: Fixed
 Assignee: Marco Gaido
Fix Version/s: 2.3.3

> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Assignee: Marco Gaido
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.3
>
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-14 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25051:

Fix Version/s: (was: 2.3.3)
   2.3.2

> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Assignee: Marco Gaido
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.2
>
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools

2018-08-14 Thread Nihar Sheth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580113#comment-16580113
 ] 

Nihar Sheth commented on SPARK-24938:
-

Your expectation is correct, the offheap pools remained at 16 mb after adding 
this change. There does appear to be a tiny corresponding change in the number 
of offheap bytes used, so I agree that netty is only using the offheap buffers.

> Understand usage of netty's onheap memory use, even with offheap pools
> --
>
> Key: SPARK-24938
> URL: https://issues.apache.org/jira/browse/SPARK-24938
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> We've observed that netty uses large amount of onheap memory in its pools, in 
> addition to the expected offheap memory when I added some instrumentation 
> (using SPARK-24918 and https://github.com/squito/spark-memory). We should 
> figure out why its using that memory, and whether its really necessary.
> It might be just this one line:
> https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82
> which means that even with a small burst of messages, each arena will grow by 
> 16MB which could lead to a 128 MB spike of an almost entirely unused pool.  
> Switching to requesting a buffer from the default pool would probably fix 
> this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25115) Eliminate extra memory copy done when a ByteBuf is used that is backed by > 1 ByteBuffer.

2018-08-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25115:


Assignee: Apache Spark

> Eliminate extra memory copy done when a ByteBuf is used that is backed by 
> > 1 ByteBuffer.
> -
>
> Key: SPARK-25115
> URL: https://issues.apache.org/jira/browse/SPARK-25115
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.3.1
>Reporter: Norman Maurer
>Assignee: Apache Spark
>Priority: Major
>
> Calling ByteBuf.nioBuffer(...) can be costly when the ByteBuf is backed by 
> more then 1 ByteBuf. In this case it makes more sense to call 
> ByteBuf.nioBuffers(...) and iterate over the returned ByteBuffer[].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25115) Eliminate extra memory copy done when a ByteBuf is used that is backed by > 1 ByteBuffer.

2018-08-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580106#comment-16580106
 ] 

Apache Spark commented on SPARK-25115:
--

User 'normanmaurer' has created a pull request for this issue:
https://github.com/apache/spark/pull/22105

> Eliminate extra memory copy done when a ByteBuf is used that is backed by 
> > 1 ByteBuffer.
> -
>
> Key: SPARK-25115
> URL: https://issues.apache.org/jira/browse/SPARK-25115
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.3.1
>Reporter: Norman Maurer
>Priority: Major
>
> Calling ByteBuf.nioBuffer(...) can be costly when the ByteBuf is backed by 
> more then 1 ByteBuf. In this case it makes more sense to call 
> ByteBuf.nioBuffers(...) and iterate over the returned ByteBuffer[].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25115) Eliminate extra memory copy done when a ByteBuf is used that is backed by > 1 ByteBuffer.

2018-08-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25115:


Assignee: (was: Apache Spark)

> Eliminate extra memory copy done when a ByteBuf is used that is backed by 
> > 1 ByteBuffer.
> -
>
> Key: SPARK-25115
> URL: https://issues.apache.org/jira/browse/SPARK-25115
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.3.1
>Reporter: Norman Maurer
>Priority: Major
>
> Calling ByteBuf.nioBuffer(...) can be costly when the ByteBuf is backed by 
> more then 1 ByteBuf. In this case it makes more sense to call 
> ByteBuf.nioBuffers(...) and iterate over the returned ByteBuffer[].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25115) Eliminate extra memory copy done when a ByteBuf is used that is backed by > 1 ByteBuffer.

2018-08-14 Thread Norman Maurer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580087#comment-16580087
 ] 

Norman Maurer commented on SPARK-25115:
---

I opened the following PR to fix it:

 

https://github.com/apache/spark/pull/22105

> Eliminate extra memory copy done when a ByteBuf is used that is backed by 
> > 1 ByteBuffer.
> -
>
> Key: SPARK-25115
> URL: https://issues.apache.org/jira/browse/SPARK-25115
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.3.1
>Reporter: Norman Maurer
>Priority: Major
>
> Calling ByteBuf.nioBuffer(...) can be costly when the ByteBuf is backed by 
> more then 1 ByteBuf. In this case it makes more sense to call 
> ByteBuf.nioBuffers(...) and iterate over the returned ByteBuffer[].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25115) Eliminate extra memory copy done when a ByteBuf is used that is backed by > 1 ByteBuffer.

2018-08-14 Thread Norman Maurer (JIRA)
Norman Maurer created SPARK-25115:
-

 Summary: Eliminate extra memory copy done when a ByteBuf is 
used that is backed by > 1 ByteBuffer.
 Key: SPARK-25115
 URL: https://issues.apache.org/jira/browse/SPARK-25115
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.3.1
Reporter: Norman Maurer


Calling ByteBuf.nioBuffer(...) can be costly when the ByteBuf is backed by more 
then 1 ByteBuf. In this case it makes more sense to call 
ByteBuf.nioBuffers(...) and iterate over the returned ByteBuffer[].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging

2018-08-14 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580064#comment-16580064
 ] 

Marcelo Vanzin commented on SPARK-24787:


In that case it might be good to only use hsync in "safer" contexts (i.e. not 
in event storms like task updates).

> Events being dropped at an alarming rate due to hsync being slow for 
> eventLogging
> -
>
> Key: SPARK-24787
> URL: https://issues.apache.org/jira/browse/SPARK-24787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Sanket Reddy
>Priority: Minor
>
> [https://github.com/apache/spark/pull/16924/files] updates the length of the 
> inprogress files allowing history server being responsive.
> Although we have a production job that has 6 tasks per stage and due to 
> hsync being slow it starts dropping events and the history server has wrong 
> stats due to events being dropped.
> A viable solution is not to make it sync very frequently or make it 
> configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-14 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580062#comment-16580062
 ] 

Marcelo Vanzin commented on SPARK-24771:


Asking is a good start. But I have anecdotal evidence that there are quite a 
few people who use Avro/RDDs... not sure whether they're planning to move to 
SQL any time soon.

In any case, it would be good to know exactly what breaks so that we can have a 
proper release note, instead of just dumping the problem on the user's lap.

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180

2018-08-14 Thread Ondrej Kokes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579974#comment-16579974
 ] 

Ondrej Kokes commented on SPARK-22236:
--

Multiline=true by default would cause some slowdown, but data quality would 
either increase or stay the same - it would never go down. So the discussion 
there is mostly about performance.

With escape changes, while we would see improvements in data quality on the 
input side, *people with preexisting datasets exported by Spark would suffer 
(unexpected) data loss,* because the escaping strategy would potentially differ 
from the time the data was written.

I think that's a bit more important aspect to consider.

> CSV I/O: does not respect RFC 4180
> --
>
> Key: SPARK-22236
> URL: https://issues.apache.org/jira/browse/SPARK-22236
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.2.0
>Reporter: Ondrej Kokes
>Priority: Minor
>
> When reading or writing CSV files with Spark, double quotes are escaped with 
> a backslash by default. However, the appropriate behaviour as set out by RFC 
> 4180 (and adhered to by many software packages) is to escape using a second 
> double quote.
> This piece of Python code demonstrates the issue
> {code}
> import csv
> with open('testfile.csv', 'w') as f:
> cw = csv.writer(f)
> cw.writerow(['a 2.5" drive', 'another column'])
> cw.writerow(['a "quoted" string', '"quoted"'])
> cw.writerow([1,2])
> with open('testfile.csv') as f:
> print(f.read())
> # "a 2.5"" drive",another column
> # "a ""quoted"" string","""quoted"""
> # 1,2
> spark.read.csv('testfile.csv').collect()
> # [Row(_c0='"a 2.5"" drive"', _c1='another column'),
> #  Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'),
> #  Row(_c0='1', _c1='2')]
> # explicitly stating the escape character fixed the issue
> spark.read.option('escape', '"').csv('testfile.csv').collect()
> # [Row(_c0='a 2.5" drive', _c1='another column'),
> #  Row(_c0='a "quoted" string', _c1='"quoted"'),
> #  Row(_c0='1', _c1='2')]
> {code}
> The same applies to writes, where reading the file written by Spark may 
> result in garbage.
> {code}
> df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file 
> correctly
> df.write.format("csv").save('testout.csv')
> with open('testout.csv/part-csv') as f:
> cr = csv.reader(f)
> print(next(cr))
> print(next(cr))
> # ['a 2.5\\ drive"', 'another column']
> # ['a \\quoted\\" string"', '\\quoted\\""']
> {code}
> The culprit is in 
> [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91],
>  where the default escape character is overridden.
> While it's possible to work with CSV files in a "compatible" manner, it would 
> be useful if Spark had sensible defaults that conform to the above-mentioned 
> RFC (as well as W3C recommendations). I realise this would be a breaking 
> change and thus if accepted, it would probably need to result in a warning 
> first, before moving to a new default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-14 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25051:

Component/s: (was: Spark Core)

> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Priority: Blocker
>  Labels: correctness
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-14 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579967#comment-16579967
 ] 

Xiao Li commented on SPARK-25051:
-

[~mgaido] This breaks the backport rule. We are unable to remove 
AnalysisBarrier from 2.3. AnalysisBarrier is a nightmare to many. Sorry for 
that. 

> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Priority: Blocker
>  Labels: correctness
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24561) User-defined window functions with pandas udf (bounded window)

2018-08-14 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579965#comment-16579965
 ] 

Li Jin commented on SPARK-24561:


I am looking into this. Early investigation: 
https://docs.google.com/document/d/14EjeY5z4-NC27-SmIP9CsMPCANeTcvxN44a7SIJtZPc/edit

> User-defined window functions with pandas udf (bounded window)
> --
>
> Key: SPARK-24561
> URL: https://issues.apache.org/jira/browse/SPARK-24561
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Li Jin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25083) remove the type erasure hack in data source scan

2018-08-14 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579961#comment-16579961
 ] 

Ryan Blue commented on SPARK-25083:
---

[~cloud_fan], what release is this targeting?

> remove the type erasure hack in data source scan
> 
>
> Key: SPARK-25083
> URL: https://issues.apache.org/jira/browse/SPARK-25083
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> It's hacky to pretend a `RDD[ColumnarBatch]` to be a `RDD[InternalRow]`. We 
> should make the type explicit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources

2018-08-14 Thread Li Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24721:
---
Component/s: SQL

> Failed to use PythonUDF with literal inputs in filter with data sources
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> 

[jira] [Updated] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources

2018-08-14 Thread Li Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24721:
---
Issue Type: Bug  (was: Sub-task)
Parent: (was: SPARK-22216)

> Failed to use PythonUDF with literal inputs in filter with data sources
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> 

[jira] [Comment Edited] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources

2018-08-14 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579956#comment-16579956
 ] 

Li Jin edited comment on SPARK-24721 at 8/14/18 3:26 PM:
-

Updated Jira title to reflect the actual issue


was (Author: icexelloss):
Updates Jira title to reflect the actual issue

> Failed to use PythonUDF with literal inputs in filter with data sources
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> 

[jira] [Commented] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources

2018-08-14 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579956#comment-16579956
 ] 

Li Jin commented on SPARK-24721:


Updates Jira title to reflect the actual issue

> Failed to use PythonUDF with literal inputs in filter with data sources
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> 

[jira] [Updated] (SPARK-24721) Failed to use PythonUDF with literal inputs in filter with data sources

2018-08-14 Thread Li Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24721:
---
Summary: Failed to use PythonUDF with literal inputs in filter with data 
sources  (was: Failed to call PythonUDF whose input is the output of another 
PythonUDF)

> Failed to use PythonUDF with literal inputs in filter with data sources
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> 

[jira] [Commented] (SPARK-24941) Add RDDBarrier.coalesce() function

2018-08-14 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579930#comment-16579930
 ] 

Jiang Xingbo commented on SPARK-24941:
--

Shall we add something like `spark.default.parallelism`? It maybe not like a 
fixed number but be a fraction to say that any barrier stage shall launch tasks 
less than the fraction * totalCores ?

> Add RDDBarrier.coalesce() function
> --
>
> Key: SPARK-24941
> URL: https://issues.apache.org/jira/browse/SPARK-24941
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> https://github.com/apache/spark/pull/21758#discussion_r204917245
> The number of partitions from the input data can be unexpectedly large, eg. 
> if you do
> {code}
> sc.textFile(...).barrier().mapPartitions()
> {code}
> The number of input partitions is based on the hdfs input splits. We shall 
> provide a way in RDDBarrier to enable users to specify the number of tasks in 
> a barrier stage. Maybe something like RDDBarrier.coalesce(numPartitions: Int) 
> .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24721) Failed to call PythonUDF whose input is the output of another PythonUDF

2018-08-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579883#comment-16579883
 ] 

Apache Spark commented on SPARK-24721:
--

User 'icexelloss' has created a pull request for this issue:
https://github.com/apache/spark/pull/22104

> Failed to call PythonUDF whose input is the output of another PythonUDF
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> 

[jira] [Assigned] (SPARK-24721) Failed to call PythonUDF whose input is the output of another PythonUDF

2018-08-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24721:


Assignee: (was: Apache Spark)

> Failed to call PythonUDF whose input is the output of another PythonUDF
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> 

[jira] [Assigned] (SPARK-24721) Failed to call PythonUDF whose input is the output of another PythonUDF

2018-08-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24721:


Assignee: Apache Spark

> Failed to call PythonUDF whose input is the output of another PythonUDF
> ---
>
> Key: SPARK-24721
> URL: https://issues.apache.org/jira/browse/SPARK-24721
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> {code}
> import random
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def random_probability(label):
> if label == 1.0:
>   return random.uniform(0.5, 1.0)
> else:
>   return random.uniform(0.0, 0.4999)
> def randomize_label(ratio):
> 
> if random.random() >= ratio:
>   return 1.0
> else:
>   return 0.0
> random_probability = udf(random_probability, DoubleType())
> randomize_label = udf(randomize_label, DoubleType())
> spark.range(10).write.mode("overwrite").format('csv').save("/tmp/tab3")
> babydf = spark.read.csv("/tmp/tab3")
> data_modified_label = babydf.withColumn(
>   'random_label', randomize_label(lit(1 - 0.1))
> )
> data_modified_random = data_modified_label.withColumn(
>   'random_probability', 
>   random_probability(col('random_label'))
> )
> data_modified_label.filter(col('random_label') == 0).show()
> {code}
> The above code will generate the following exception:
> {code}
> Py4JJavaError: An error occurred while calling o446.showString.
> : java.lang.RuntimeException: Invalid PythonUDF randomize_label(0.9), 
> requires attributes from more than one child.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:166)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:165)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:165)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:116)
>   at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:112)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:325)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:307)
>   at 
> 

[jira] [Assigned] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE

2018-08-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25114:


Assignee: (was: Apache Spark)

> RecordBinaryComparator may return wrong result when subtraction between two 
> words is divisible by Integer.MAX_VALUE
> ---
>
> Key: SPARK-25114
> URL: https://issues.apache.org/jira/browse/SPARK-25114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Priority: Blocker
>  Labels: correctness
>
> It is possible for two objects to be unequal and yet we consider them as 
> equal within RecordBinaryComparator, if the long values are separated by 
> Int.MaxValue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE

2018-08-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579880#comment-16579880
 ] 

Apache Spark commented on SPARK-25114:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/22101

> RecordBinaryComparator may return wrong result when subtraction between two 
> words is divisible by Integer.MAX_VALUE
> ---
>
> Key: SPARK-25114
> URL: https://issues.apache.org/jira/browse/SPARK-25114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Priority: Blocker
>  Labels: correctness
>
> It is possible for two objects to be unequal and yet we consider them as 
> equal within RecordBinaryComparator, if the long values are separated by 
> Int.MaxValue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE

2018-08-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25114:


Assignee: Apache Spark

> RecordBinaryComparator may return wrong result when subtraction between two 
> words is divisible by Integer.MAX_VALUE
> ---
>
> Key: SPARK-25114
> URL: https://issues.apache.org/jira/browse/SPARK-25114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>Priority: Blocker
>  Labels: correctness
>
> It is possible for two objects to be unequal and yet we consider them as 
> equal within RecordBinaryComparator, if the long values are separated by 
> Int.MaxValue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE

2018-08-14 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579879#comment-16579879
 ] 

Jiang Xingbo commented on SPARK-25114:
--

I created https://github.com/apache/spark/pull/22101 for this.

> RecordBinaryComparator may return wrong result when subtraction between two 
> words is divisible by Integer.MAX_VALUE
> ---
>
> Key: SPARK-25114
> URL: https://issues.apache.org/jira/browse/SPARK-25114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Priority: Blocker
>  Labels: correctness
>
> It is possible for two objects to be unequal and yet we consider them as 
> equal within RecordBinaryComparator, if the long values are separated by 
> Int.MaxValue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE

2018-08-14 Thread Jiang Xingbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiang Xingbo updated SPARK-25114:
-
Labels: correctness  (was: )

> RecordBinaryComparator may return wrong result when subtraction between two 
> words is divisible by Integer.MAX_VALUE
> ---
>
> Key: SPARK-25114
> URL: https://issues.apache.org/jira/browse/SPARK-25114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Priority: Blocker
>  Labels: correctness
>
> It is possible for two objects to be unequal and yet we consider them as 
> equal within RecordBinaryComparator, if the long values are separated by 
> Int.MaxValue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE

2018-08-14 Thread Jiang Xingbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiang Xingbo updated SPARK-25114:
-
Priority: Blocker  (was: Major)

> RecordBinaryComparator may return wrong result when subtraction between two 
> words is divisible by Integer.MAX_VALUE
> ---
>
> Key: SPARK-25114
> URL: https://issues.apache.org/jira/browse/SPARK-25114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Priority: Blocker
>  Labels: correctness
>
> It is possible for two objects to be unequal and yet we consider them as 
> equal within RecordBinaryComparator, if the long values are separated by 
> Int.MaxValue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE

2018-08-14 Thread Jiang Xingbo (JIRA)
Jiang Xingbo created SPARK-25114:


 Summary: RecordBinaryComparator may return wrong result when 
subtraction between two words is divisible by Integer.MAX_VALUE
 Key: SPARK-25114
 URL: https://issues.apache.org/jira/browse/SPARK-25114
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Jiang Xingbo


It is possible for two objects to be unequal and yet we consider them as equal 
within RecordBinaryComparator, if the long values are separated by Int.MaxValue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging

2018-08-14 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579854#comment-16579854
 ] 

Thomas Graves commented on SPARK-24787:
---

Yes it was caused by hsync, hsync has to go to the namenode in addition to the 
datanode, hflush is datanode only operation.   We saw huge increase in dropped 
events with this on large jobs, we reverted the change and went back to only 
hflush and it stopped dropping.

Talked to one of our hdfs experts and he said hsync is expensive.  it might 
depend on how loaded your hdfs cluster is.

> Events being dropped at an alarming rate due to hsync being slow for 
> eventLogging
> -
>
> Key: SPARK-24787
> URL: https://issues.apache.org/jira/browse/SPARK-24787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Sanket Reddy
>Priority: Minor
>
> [https://github.com/apache/spark/pull/16924/files] updates the length of the 
> inprogress files allowing history server being responsive.
> Although we have a production job that has 6 tasks per stage and due to 
> hsync being slow it starts dropping events and the history server has wrong 
> stats due to events being dropped.
> A viable solution is not to make it sync very frequently or make it 
> configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24918) Executor Plugin API

2018-08-14 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579851#comment-16579851
 ] 

Thomas Graves commented on SPARK-24918:
---

Personally I like the explicit config on better (spark.executor.plugins).  Opt 
out in my opinion is easier for users to mess up.  Someone includes jar someone 
some other group and doesn't realize it has this ServiceLoader.

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: SPIP, memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-14 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-25051:
--
Priority: Blocker  (was: Major)

> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Priority: Blocker
>  Labels: correctness
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23938) High-order function: map_zip_with(map, map, function) → map

2018-08-14 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23938.
---
   Resolution: Fixed
 Assignee: Marek Novotny
Fix Version/s: 2.4.0

Issue resolved by pull request 22017
https://github.com/apache/spark/pull/22017

> High-order function: map_zip_with(map, map, function V3>) → map
> ---
>
> Key: SPARK-23938
> URL: https://issues.apache.org/jira/browse/SPARK-23938
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marek Novotny
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref:  https://prestodb.io/docs/current/functions/map.html
> Merges the two given maps into a single map by applying function to the pair 
> of values with the same key. For keys only presented in one map, NULL will be 
> passed as the value for the missing key.
> {noformat}
> SELECT map_zip_with(MAP(ARRAY[1, 2, 3], ARRAY['a', 'b', 'c']), -- {1 -> ad, 2 
> -> be, 3 -> cf}
> MAP(ARRAY[1, 2, 3], ARRAY['d', 'e', 'f']),
> (k, v1, v2) -> concat(v1, v2));
> SELECT map_zip_with(MAP(ARRAY['k1', 'k2'], ARRAY[1, 2]), -- {k1 -> ROW(1, 
> null), k2 -> ROW(2, 4), k3 -> ROW(null, 9)}
> MAP(ARRAY['k2', 'k3'], ARRAY[4, 9]),
> (k, v1, v2) -> (v1, v2));
> SELECT map_zip_with(MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 8, 27]), -- {a -> a1, 
> b -> b4, c -> c9}
> MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 2, 3]),
> (k, v1, v2) -> k || CAST(v1/v2 AS VARCHAR));
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24411) Adding native Java tests for `isInCollection`

2018-08-14 Thread Aleksei Izmalkin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579650#comment-16579650
 ] 

Aleksei Izmalkin commented on SPARK-24411:
--

I will work on this issue.

> Adding native Java tests for `isInCollection`
> -
>
> Key: SPARK-24411
> URL: https://issues.apache.org/jira/browse/SPARK-24411
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Minor
>  Labels: starter
>
> In the past, some of our Java APIs have been difficult to call from Java. We 
> should add tests in Java directly to make sure it works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25113) Add logging to CodeGenerator when any generated method's bytecode size goes above HugeMethodLimit

2018-08-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579645#comment-16579645
 ] 

Apache Spark commented on SPARK-25113:
--

User 'rednaxelafx' has created a pull request for this issue:
https://github.com/apache/spark/pull/22103

> Add logging to CodeGenerator when any generated method's bytecode size goes 
> above HugeMethodLimit
> -
>
> Key: SPARK-25113
> URL: https://issues.apache.org/jira/browse/SPARK-25113
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Major
>
> Add logging to help collect statistics on how often real world usage sees the 
> {{CodeGenerator}} generating methods whose bytecode size goes above the 8000 
> bytes (HugeMethodLimit) threshold.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25113) Add logging to CodeGenerator when any generated method's bytecode size goes above HugeMethodLimit

2018-08-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25113:


Assignee: (was: Apache Spark)

> Add logging to CodeGenerator when any generated method's bytecode size goes 
> above HugeMethodLimit
> -
>
> Key: SPARK-25113
> URL: https://issues.apache.org/jira/browse/SPARK-25113
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Major
>
> Add logging to help collect statistics on how often real world usage sees the 
> {{CodeGenerator}} generating methods whose bytecode size goes above the 8000 
> bytes (HugeMethodLimit) threshold.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25113) Add logging to CodeGenerator when any generated method's bytecode size goes above HugeMethodLimit

2018-08-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25113:


Assignee: Apache Spark

> Add logging to CodeGenerator when any generated method's bytecode size goes 
> above HugeMethodLimit
> -
>
> Key: SPARK-25113
> URL: https://issues.apache.org/jira/browse/SPARK-25113
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Assignee: Apache Spark
>Priority: Major
>
> Add logging to help collect statistics on how often real world usage sees the 
> {{CodeGenerator}} generating methods whose bytecode size goes above the 8000 
> bytes (HugeMethodLimit) threshold.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25113) Add logging to CodeGenerator when any generated method's bytecode size goes above HugeMethodLimit

2018-08-14 Thread Kris Mok (JIRA)
Kris Mok created SPARK-25113:


 Summary: Add logging to CodeGenerator when any generated method's 
bytecode size goes above HugeMethodLimit
 Key: SPARK-25113
 URL: https://issues.apache.org/jira/browse/SPARK-25113
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Kris Mok


Add logging to help collect statistics on how often real world usage sees the 
{{CodeGenerator}} generating methods whose bytecode size goes above the 8000 
bytes (HugeMethodLimit) threshold.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25102) Write Spark version information to Parquet file footers

2018-08-14 Thread Nikita Poberezkin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579622#comment-16579622
 ] 

Nikita Poberezkin commented on SPARK-25102:
---

I will work on this issue

> Write Spark version information to Parquet file footers
> ---
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> -PARQUET-352- added support for the "writer.model.name" property in the 
> Parquet metadata to identify the object model (application) that wrote the 
> file.
> The easiest way to write this property is by overriding getName() of 
> org.apache.parquet.hadoop.api.WriteSupport. In Spark, this would mean adding 
> getName() to the 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25051:


Assignee: Apache Spark

> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579526#comment-16579526
 ] 

Apache Spark commented on SPARK-25051:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22102

> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Priority: Major
>  Labels: correctness
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25051:


Assignee: (was: Apache Spark)

> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Priority: Major
>  Labels: correctness
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-14 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579514#comment-16579514
 ] 

Marco Gaido commented on SPARK-25051:
-

This was caused by the introduction of AnalysisBarrier. I will submit a PR for 
branch 2.3. On 2.4+ (current master) we don't have anymore this issue because 
AnalysisBarrier was removed. Anyway, this brings a question to me: shall we 
remove AnalysisBarrier from 2.3 line too? In the current situation, backporting 
any analyzer fix to 2.3 is going to be painful.
cc [~rxin] [~cloud_fan]

> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Priority: Major
>  Labels: correctness
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-14 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579444#comment-16579444
 ] 

Marco Gaido commented on SPARK-25051:
-

cc [~jerryshao] shall we set it as a blocker for 2.3.2?

> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Priority: Major
>  Labels: correctness
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-14 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-25051:

Labels: correctness  (was: )

> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Priority: Major
>  Labels: correctness
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25068) High-order function: exists(array, function) → boolean

2018-08-14 Thread Marek Novotny (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579386#comment-16579386
 ] 

Marek Novotny commented on SPARK-25068:
---

That's a good point. Thanks for your answer!

> High-order function: exists(array, function) → boolean
> -
>
> Key: SPARK-25068
> URL: https://issues.apache.org/jira/browse/SPARK-25068
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.4.0
>
>
> Tests if arrays have those elements for which function returns true.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >