[jira] [Comment Edited] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces

2020-01-26 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023935#comment-17023935
 ] 

Jorge Machado edited comment on SPARK-23148 at 1/27/20 7:22 AM:


Hi [~hyukjin.kwon] and [~henryr]  , I have the same problem if I create a 
custom data source 

```

class ImageFileValidator extends FileFormat with DataSourceRegister with 
Serializable

```

So the Problem Needs to be in some other places. 

Here my trace: 

 

I created https://issues.apache.org/jira/browse/SPARK-30647
{code:java}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 
213, localhost, executor driver): java.io.FileNotFoundException: File 
file:somePath/0019_leftImg8%20bit.png does not exist
It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
   at 
org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
   at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
   at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
   at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)

{code}


was (Author: jomach):
Hi [~hyukjin.kwon] and [~henryr]  , I have the same problem if I create a 
custom data source 

```

class ImageFileValidator extends FileFormat with DataSourceRegister with 
Serializable

```

So the Problem Needs to be in some other places. 

Here my trace: 
{code:java}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 
213, localhost, executor driver): java.io.FileNotFoundException: File 
file:somePath/0019_leftImg8%20bit.png does not exist
It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
   at 
org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
   at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
   at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
   at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)

{code}

> spark.read.csv with multiline=true gives FileNotFoundException if path 
> contains spaces
> --
>
> Key: SPARK-23148
> URL: https://issues.apache.org/jira/browse/SPARK-23148
>   

[jira] [Updated] (SPARK-30647) When creating a custom datasource File NotFoundExpection happens

2020-01-26 Thread Jorge Machado (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-30647:
--
Issue Type: Bug  (was: Improvement)

> When creating a custom datasource File NotFoundExpection happens
> 
>
> Key: SPARK-30647
> URL: https://issues.apache.org/jira/browse/SPARK-30647
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Jorge Machado
>Priority: Major
>
> Hello, I'm creating a datasource based on FileFormat and DataSourceRegister. 
> when I pass a path or a file that has a white space it seems to fail wit the 
> error: 
> {code:java}
>  org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 
> (TID 213, localhost, executor driver): java.io.FileNotFoundException: File 
> file:somePath/0019_leftImg8%20bit.png does not exist It is possible the 
> underlying files have been updated. You can explicitly invalidate the cache 
> in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating 
> the Dataset/DataFrame involved. at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>  at 
> org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> {code}
> I'm happy to fix this if someone tells me where I need to look.  
> I think it is on org.apache.spark.rdd.InputFileBlockHolder : 
> {code:java}
> inputBlock.set(new FileBlock(UTF8String.fromString(filePath), startOffset, 
> length))
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30647) When creating a custom datasource File NotFoundExpection happens

2020-01-26 Thread Jorge Machado (Jira)
Jorge Machado created SPARK-30647:
-

 Summary: When creating a custom datasource File NotFoundExpection 
happens
 Key: SPARK-30647
 URL: https://issues.apache.org/jira/browse/SPARK-30647
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.2
Reporter: Jorge Machado


Hello, I'm creating a datasource based on FileFormat and DataSourceRegister. 

when I pass a path or a file that has a white space it seems to fail wit the 
error: 
{code:java}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 
213, localhost, executor driver): java.io.FileNotFoundException: File 
file:somePath/0019_leftImg8%20bit.png does not exist It is possible the 
underlying files have been updated. You can explicitly invalidate the cache in 
Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the 
Dataset/DataFrame involved. at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
 at 
org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
 at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) 
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
{code}
I'm happy to fix this if someone tells me where I need to look.  

I think it is on org.apache.spark.rdd.InputFileBlockHolder : 
{code:java}
inputBlock.set(new FileBlock(UTF8String.fromString(filePath), startOffset, 
length))
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30646) transform_keys function throws exception as "Cannot use null as map key", but there isn't any null key in the map

2020-01-26 Thread Arun Jijo (Jira)
Arun Jijo created SPARK-30646:
-

 Summary: transform_keys function throws exception as "Cannot use 
null as map key", but there isn't any null key in the map 
 Key: SPARK-30646
 URL: https://issues.apache.org/jira/browse/SPARK-30646
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Arun Jijo


Have started experimenting Spark 3.0 new SQL functions and along the way found 
an issue with the *transform_keys* function. It is raising "Cannot use null as 
map key" exception but the Map actually doesn't hold any Null values.

Find my spark code below to reproduce the error.
{code:java}
val df = Seq(Map("EID_1"->1,"EID_2"->25000)).toDF("employees")
df.withColumn("employees",transform_keys($"employees",(k,v)=>lit(k.+("XYX"
   .show
{code}
Exception in thread "main" java.lang.RuntimeException: *Cannot use null as map 
key*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30643) Add support for embedding Hive 3

2020-01-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17024062#comment-17024062
 ] 

Dongjoon Hyun edited comment on SPARK-30643 at 1/27/20 2:36 AM:


It sounds like a misunderstanding on the role of embedded Hive. It's just used 
to talk Hive metastore.
{quote}But if I chose to run Hive 3 and Spark with embedded Hive 2.3, then 
SparkSQL and Hive queries behavior could differ in some cases.
{quote}
Everything (SQL Parser/Analyzer/Optimizer and execution engine) are Spark's own 
code. So, in general, the embedded Hive 1.2/2.3 doesn't make a different. The 
exceptional cases might be Hive bugs. For example, Spark 3.0.0 will ship with 
Hive 1.2 and Hive 2.3 (default), and all UTs passed in both environment with 
same results.

For the following, I don't think Apache Spark need to have Hive 1.2/2.3/3.1 in 
Apache Spark 3.x era. Adding 2.3 took away so much efforts from Apache Spark 
community, so it couldn't happen in Apache Spark 2.x. Maybe, we can revisit 
this issue for Apache Spark 4.0 if there is many users who running Hive 3.x in 
the production stably (not beta.)
{quote}I think that majority of reasons that went into support of embedding 
Hive 2.3 will apply to support of embedding Hive 3.
{quote}


was (Author: dongjoon):
It sounds like a misunderstanding on the role of embedded Hive. It's just used 
to talk Hive metastore.
{quote}But if I chose to run Hive 3 and Spark with embedded Hive 2.3, then 
SparkSQL and Hive queries behavior could differ in some cases.
{quote}
Everything (SQL Parser/Analyzer/Optimizer and execution engine) are Spark's own 
code. So, in general, the embedded Hive 1.2/2.3 doesn't make a different. The 
exceptional cases might be Hive bugs. For example, Spark 3.0.0 will ship with 
Hive 1.2 and Hive 2.3 (default), and all UTs passed in both environment with 
same results.

For the following, I don't think Apache Spark need to have Hive 1.2 and Hive 
2.3 and 3.1 in Apache Spark 3.x era. Adding 2.3 took away too many efforts from 
Apache Spark community, so it couldn't happen in Apache Spark 2.x. Maybe, we 
can consider that for Apache Spark 4.0 if there is many users who running Hive 
3.x in the production stably (not beta.)
{quote}I think that majority of reasons that went into support of embedding 
Hive 2.3 will apply to support of embedding Hive 3.
{quote}

> Add support for embedding Hive 3
> 
>
> Key: SPARK-30643
> URL: https://issues.apache.org/jira/browse/SPARK-30643
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Igor Dvorzhak
>Priority: Major
>
> Currently Spark can be compiled only against Hive 1.2.1 and Hive 2.3, 
> compilation fails against Hive 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30643) Add support for embedding Hive 3

2020-01-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17024062#comment-17024062
 ] 

Dongjoon Hyun edited comment on SPARK-30643 at 1/27/20 2:28 AM:


It sounds like a misunderstanding on the role of embedded Hive. It's just used 
to talk Hive metastore.
{quote}But if I chose to run Hive 3 and Spark with embedded Hive 2.3, then 
SparkSQL and Hive queries behavior could differ in some cases.
{quote}
Everything (SQL Parser/Analyzer/Optimizer and execution engine) are Spark's own 
code. So, in general, the embedded Hive 1.2/2.3 doesn't make a different. The 
exceptional cases might be Hive bugs. For example, Spark 3.0.0 will ship with 
Hive 1.2 and Hive 2.3 (default), and all UTs passed in both environment with 
same results.

For the following, I don't think Apache Spark need to have Hive 1.2 and Hive 
2.3 and 3.1 in Apache Spark 3.x era. Adding 2.3 took away too many efforts from 
Apache Spark community, so it couldn't happen in Apache Spark 2.x. Maybe, we 
can consider that for Apache Spark 4.0 if there is many users who running Hive 
3.x in the production stably (not beta.)
{quote}I think that majority of reasons that went into support of embedding 
Hive 2.3 will apply to support of embedding Hive 3.
{quote}


was (Author: dongjoon):
It sounds like a misunderstanding on the role of embedded Hive. It's just used 
to talk Hive metastore.
> But if I chose to run Hive 3 and Spark with embedded Hive 2.3, then SparkSQL 
> and Hive queries behavior could differ in some cases.

Everything (SQL Parser/Analyzer/Optimizer and execution engine) are Spark's own 
code. So, in general, the embedded Hive 1.2/2.3 doesn't make a different. The 
exceptional cases might be Hive bugs. For example, Spark 3.0.0 will ship with 
Hive 1.2 and Hive 2.3 (default), and all UTs passed in both environment with 
same results.

I don't think Apache Spark need to have Hive 1.2 and Hive 2.3 and 3.1 in Apache 
Spark 3.x era. Adding 2.3 took away too many efforts from Apache Spark 
community, so it couldn't happen in Apache Spark 2.x. Maybe, we can consider 
that for Apache Spark 4.0 if there is many users who running Hive 3.x in the 
production stably (not beta.)
> I think that majority of reasons that went into support of embedding Hive 2.3 
> will apply to support of embedding Hive 3.


> Add support for embedding Hive 3
> 
>
> Key: SPARK-30643
> URL: https://issues.apache.org/jira/browse/SPARK-30643
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Igor Dvorzhak
>Priority: Major
>
> Currently Spark can be compiled only against Hive 1.2.1 and Hive 2.3, 
> compilation fails against Hive 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30643) Add support for embedding Hive 3

2020-01-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17024062#comment-17024062
 ] 

Dongjoon Hyun commented on SPARK-30643:
---

It sounds like a misunderstanding on the role of embedded Hive. It's just used 
to talk Hive metastore.
> But if I chose to run Hive 3 and Spark with embedded Hive 2.3, then SparkSQL 
> and Hive queries behavior could differ in some cases.

Everything (SQL Parser/Analyzer/Optimizer and execution engine) are Spark's own 
code. So, in general, the embedded Hive 1.2/2.3 doesn't make a different. The 
exceptional cases might be Hive bugs. For example, Spark 3.0.0 will ship with 
Hive 1.2 and Hive 2.3 (default), and all UTs passed in both environment with 
same results.

I don't think Apache Spark need to have Hive 1.2 and Hive 2.3 and 3.1 in Apache 
Spark 3.x era. Adding 2.3 took away too many efforts from Apache Spark 
community, so it couldn't happen in Apache Spark 2.x. Maybe, we can consider 
that for Apache Spark 4.0 if there is many users who running Hive 3.x in the 
production stably (not beta.)
> I think that majority of reasons that went into support of embedding Hive 2.3 
> will apply to support of embedding Hive 3.


> Add support for embedding Hive 3
> 
>
> Key: SPARK-30643
> URL: https://issues.apache.org/jira/browse/SPARK-30643
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Igor Dvorzhak
>Priority: Major
>
> Currently Spark can be compiled only against Hive 1.2.1 and Hive 2.3, 
> compilation fails against Hive 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30640) Prevent unnessary copies of data in Arrow to Pandas conversion with Timestamps

2020-01-26 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-30640.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27358
[https://github.com/apache/spark/pull/27358]

> Prevent unnessary copies of data in Arrow to Pandas conversion with Timestamps
> --
>
> Key: SPARK-30640
> URL: https://issues.apache.org/jira/browse/SPARK-30640
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 3.0.0
>
>
> During conversion of Arrow to Pandas, timestamp columns are modified to 
> localize for the current timezone. If there are no timestamp columns, this 
> can sometimes result in unnecessary copies of the data. See 
> [https://www.mail-archive.com/dev@arrow.apache.org/msg17008.html] for 
> discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30640) Prevent unnessary copies of data in Arrow to Pandas conversion with Timestamps

2020-01-26 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-30640:


Assignee: Bryan Cutler

> Prevent unnessary copies of data in Arrow to Pandas conversion with Timestamps
> --
>
> Key: SPARK-30640
> URL: https://issues.apache.org/jira/browse/SPARK-30640
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> During conversion of Arrow to Pandas, timestamp columns are modified to 
> localize for the current timezone. If there are no timestamp columns, this 
> can sometimes result in unnecessary copies of the data. See 
> [https://www.mail-archive.com/dev@arrow.apache.org/msg17008.html] for 
> discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces

2020-01-26 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023935#comment-17023935
 ] 

Jorge Machado edited comment on SPARK-23148 at 1/26/20 9:29 PM:


Hi [~hyukjin.kwon] , I have the same problem if I create a custom data source 

```

class ImageFileValidator extends FileFormat with DataSourceRegister with 
Serializable

```

So the Problem Needs to be in some other places. 

Here my trace: 
{code:java}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 
213, localhost, executor driver): java.io.FileNotFoundException: File 
file:somePath/0019_leftImg8%20bit.png does not exist
It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
   at 
org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
   at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
   at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
   at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)

{code}


was (Author: jomach):
So I have the same problem if I create a custom data source 

```

class ImageFileValidator extends FileFormat with DataSourceRegister with 
Serializable

```

So the Problem Needs to be in some other places. 

Here my trace: 
{code:java}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 
213, localhost, executor driver): java.io.FileNotFoundException: File 
file:somePath/0019_leftImg8%20bit.png does not exist
It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
   at 
org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
   at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
   at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
   at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)

{code}

> spark.read.csv with multiline=true gives FileNotFoundException if path 
> contains spaces
> --
>
> Key: SPARK-23148
> URL: https://issues.apache.org/jira/browse/SPARK-23148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> 

[jira] [Comment Edited] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces

2020-01-26 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023935#comment-17023935
 ] 

Jorge Machado edited comment on SPARK-23148 at 1/26/20 9:29 PM:


Hi [~hyukjin.kwon] and [~henryr]  , I have the same problem if I create a 
custom data source 

```

class ImageFileValidator extends FileFormat with DataSourceRegister with 
Serializable

```

So the Problem Needs to be in some other places. 

Here my trace: 
{code:java}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 
213, localhost, executor driver): java.io.FileNotFoundException: File 
file:somePath/0019_leftImg8%20bit.png does not exist
It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
   at 
org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
   at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
   at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
   at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)

{code}


was (Author: jomach):
Hi [~hyukjin.kwon] , I have the same problem if I create a custom data source 

```

class ImageFileValidator extends FileFormat with DataSourceRegister with 
Serializable

```

So the Problem Needs to be in some other places. 

Here my trace: 
{code:java}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 
213, localhost, executor driver): java.io.FileNotFoundException: File 
file:somePath/0019_leftImg8%20bit.png does not exist
It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
   at 
org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
   at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
   at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
   at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)

{code}

> spark.read.csv with multiline=true gives FileNotFoundException if path 
> contains spaces
> --
>
> Key: SPARK-23148
> URL: https://issues.apache.org/jira/browse/SPARK-23148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>

[jira] [Commented] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces

2020-01-26 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023935#comment-17023935
 ] 

Jorge Machado commented on SPARK-23148:
---

So I have the same problem if I create a custom data source 

```

class ImageFileValidator extends FileFormat with DataSourceRegister with 
Serializable

```

So the Problem Needs to be in some other places. 

Here my trace: 
{code:java}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 
213, localhost, executor driver): java.io.FileNotFoundException: File 
file:somePath/0019_leftImg8%20bit.png does not exist
It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
   at 
org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
   at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
   at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
   at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)

{code}

> spark.read.csv with multiline=true gives FileNotFoundException if path 
> contains spaces
> --
>
> Key: SPARK-23148
> URL: https://issues.apache.org/jira/browse/SPARK-23148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bogdan Raducanu
>Assignee: Henry Robinson
>Priority: Major
> Fix For: 2.3.0
>
>
> Repro code:
> {code:java}
> spark.range(10).write.csv("/tmp/a b c/a.csv")
> spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count
> 10
> spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count
> java.io.FileNotFoundException: File 
> file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv
>  does not exist
> {code}
> Trying to manually escape fails in a different place:
> {code}
> spark.read.option("multiLine", true).csv("/tmp/a%20b%20c/a.csv").count
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/tmp/a%20b%20c/a.csv;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:683)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30314) Add identifier and catalog information to DataSourceV2Relation

2020-01-26 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-30314.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/26957]

> Add identifier and catalog information to DataSourceV2Relation
> --
>
> Key: SPARK-30314
> URL: https://issues.apache.org/jira/browse/SPARK-30314
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuchen Huo
>Assignee: Yuchen Huo
>Priority: Major
> Fix For: 3.0.0
>
>
> Add identifier and catalog information in DataSourceV2Relation so it would be 
> possible to do richer checks in *checkAnalysis* step.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30314) Add identifier and catalog information to DataSourceV2Relation

2020-01-26 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-30314:
---

Assignee: Yuchen Huo

> Add identifier and catalog information to DataSourceV2Relation
> --
>
> Key: SPARK-30314
> URL: https://issues.apache.org/jira/browse/SPARK-30314
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuchen Huo
>Assignee: Yuchen Huo
>Priority: Major
>
> Add identifier and catalog information in DataSourceV2Relation so it would be 
> possible to do richer checks in *checkAnalysis* step.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16483) Unifying struct fields and columns

2020-01-26 Thread fqaiser94 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023920#comment-17023920
 ] 

fqaiser94 commented on SPARK-16483:
---

This is very similar to 
[SPARK-22231|https://issues.apache.org/jira/browse/SPARK-22231#] where there 
has been more discussion. 

> Unifying struct fields and columns
> --
>
> Key: SPARK-16483
> URL: https://issues.apache.org/jira/browse/SPARK-16483
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: sql
>
> This issue comes as a result of an exchange with Michael Armbrust outside of 
> the usual JIRA/dev list channels.
> DataFrame provides a full set of manipulation operations for top-level 
> columns. They have be added, removed, modified and renamed. The same is not 
> true about fields inside structs yet, from a logical standpoint, Spark users 
> may very well want to perform the same operations on struct fields, 
> especially since automatic schema discovery from JSON input tends to create 
> deeply nested structs.
> Common use-cases include:
>  - Remove and/or rename struct field(s) to adjust the schema
>  - Fix a data quality issue with a struct field (update/rewrite)
> To do this with the existing API by hand requires manually calling 
> {{named_struct}} and listing all fields, including ones we don't want to 
> manipulate. This leads to complex, fragile code that cannot survive schema 
> evolution.
> It would be far better if the various APIs that can now manipulate top-level 
> columns were extended to handle struct fields at arbitrary locations or, 
> alternatively, if we introduced new APIs for modifying any field in a 
> dataframe, whether it is a top-level one or one nested inside a struct.
> Purely for discussion purposes (overloaded methods are not shown):
> {code:java}
> class Column(val expr: Expression) extends Logging {
>   // ...
>   // matches Dataset.schema semantics
>   def schema: StructType
>   // matches Dataset.select() semantics
>   // '* support allows multiple new fields to be added easily, saving 
> cumbersome repeated withColumn() calls
>   def select(cols: Column*): Column
>   // matches Dataset.withColumn() semantics of add or replace
>   def withColumn(colName: String, col: Column): Column
>   // matches Dataset.drop() semantics
>   def drop(colName: String): Column
> }
> class Dataset[T] ... {
>   // ...
>   // Equivalent to sparkSession.createDataset(toDF.rdd, newSchema)
>   def cast(newShema: StructType): DataFrame
> }
> {code}
> The benefit of the above API is that it unifies manipulating top-level & 
> nested columns. The addition of {{schema}} and {{select()}} to {{Column}} 
> allows for nested field reordering, casting, etc., which is important in data 
> exchange scenarios where field position matters. That's also the reason to 
> add {{cast}} to {{Dataset}}: it improves consistency and readability (with 
> method chaining). Another way to think of {{Dataset.cast}} is as the Spark 
> schema equivalent of {{Dataset.as}}. {{as}} is to {{cast}} as a Scala 
> encodable type is to a {{StructType}} instance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-26 Thread fqaiser94 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023919#comment-17023919
 ] 

fqaiser94 edited comment on SPARK-22231 at 1/26/20 8:35 PM:


[~rxin] no problems, take your time.

Apparently Spark 3.0 code freeze is coming up soon so for those who are 
interested in seeing these features sooner rather later, I wanted to share a 
[library|https://github.com/fqaiser94/mse] I've written that adds 
{{withField}}, {{withFieldRenamed}}, and {{dropFields}} methods to the Column 
class implicitly. The signatures of the methods follows pretty much what we've 
discussed so far in this ticket. Hopefully it's helpful to others. 


was (Author: fqaiser94):
@rxin no problems, take your time.

Apparently Spark 3.0 code freeze is coming up soon so for those who are 
interested in seeing these features sooner rather later, I wanted to share a 
[library|https://github.com/fqaiser94/mse] I've written that adds 
{{withField}}, {{withFieldRenamed}}, and {{dropFields}} methods to the Column 
class implicitly. The signatures of the methods follows pretty much what we've 
discussed so far in this ticket. Hopefully it's helpful to others. 

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the ne

[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-26 Thread fqaiser94 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023919#comment-17023919
 ] 

fqaiser94 commented on SPARK-22231:
---

@rxin no problems, take your time.

Apparently Spark 3.0 code freeze is coming up soon so for those who are 
interested in seeing these features sooner rather later, I wanted to share a 
[library|https://github.com/fqaiser94/mse] I've written that adds 
{{withField}}, {{withFieldRenamed}}, and {{dropFields}} methods to the Column 
class implicitly. The signatures of the methods follows pretty much what we've 
discussed so far in this ticket. Hopefully it's helpful to others. 

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item => item.withColumn(item("b") + 1 as "b")
> }
> result.printSchema
> root
> // |-- foo: intege