[jira] [Commented] (SPARK-24959) Do not invoke the CSV/JSON parser for empty schema

2019-01-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756983#comment-16756983
 ] 

Apache Spark commented on SPARK-24959:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/23708

> Do not invoke the CSV/JSON parser for empty schema
> --
>
> Key: SPARK-24959
> URL: https://issues.apache.org/jira/browse/SPARK-24959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently JSON and CSV parsers are called even if required schema is empty. 
> Invoking the parser per each line has some non-zero overhead. The action can 
> be skipped. Such optimization should speed up count(), for example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24959) Do not invoke the CSV/JSON parser for empty schema

2019-01-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24959:


Assignee: Apache Spark

> Do not invoke the CSV/JSON parser for empty schema
> --
>
> Key: SPARK-24959
> URL: https://issues.apache.org/jira/browse/SPARK-24959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently JSON and CSV parsers are called even if required schema is empty. 
> Invoking the parser per each line has some non-zero overhead. The action can 
> be skipped. Such optimization should speed up count(), for example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24959) Do not invoke the CSV/JSON parser for empty schema

2019-01-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24959:


Assignee: (was: Apache Spark)

> Do not invoke the CSV/JSON parser for empty schema
> --
>
> Key: SPARK-24959
> URL: https://issues.apache.org/jira/browse/SPARK-24959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently JSON and CSV parsers are called even if required schema is empty. 
> Invoking the parser per each line has some non-zero overhead. The action can 
> be skipped. Such optimization should speed up count(), for example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26745) Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines

2019-01-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26745.
--
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 3.0.0

Fixed in https://github.com/apache/spark/pull/23667

> Non-parsing Dataset.count() optimization causes inconsistent results for JSON 
> inputs with empty lines
> -
>
> Key: SPARK-26745
> URL: https://issues.apache.org/jira/browse/SPARK-26745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Branden Smith
>Assignee: Hyukjin Kwon
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.0
>
>
> The optimization introduced by 
> [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959] (improving 
> performance of {{{color:#FF}count(){color}}} for DataFrames read from 
> non-multiline JSON in {{{color:#FF}PERMISSIVE{color}}} mode) appears to 
> cause {{{color:#FF}count(){color}}} to erroneously include empty lines in 
> its result total if run prior to JSON parsing taking place.
> For the following input:
> {code:json}
> { "a" : 1 , "b" : 2 , "c" : 3 }
> { "a" : 4 , "b" : 5 , "c" : 6 }
>  
> { "a" : 7 , "b" : 8 , "c" : 9 }
> {code}
> *+Spark 2.3:+*
> {code:scala}
> scala> val df = 
> spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
> df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]
> scala> df.count
> res0: Long = 3
> scala> df.cache.count
> res3: Long = 3
> {code}
> *+Spark 2.4:+*
> {code:scala}
> scala> val df = 
> spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
> df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]
> scala> df.count
> res0: Long = 7
> scala> df.cache.count
> res1: Long = 3
> {code}
> Since the count is apparently updated and cached when the Jackson parser 
> runs, the optimization also causes the count to appear to be unstable upon 
> cache/persist operations, as shown above.
> CSV inputs, also optimized via 
> [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959], do not 
> appear to be impacted by this effect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-24959) Do not invoke the CSV/JSON parser for empty schema

2019-01-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-24959:
--
  Assignee: (was: Maxim Gekk)

Reverted by SPARK-26745.

> Do not invoke the CSV/JSON parser for empty schema
> --
>
> Key: SPARK-24959
> URL: https://issues.apache.org/jira/browse/SPARK-24959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently JSON and CSV parsers are called even if required schema is empty. 
> Invoking the parser per each line has some non-zero overhead. The action can 
> be skipped. Such optimization should speed up count(), for example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24360) Support Hive 3.1 metastore

2019-01-30 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-24360.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23694

> Support Hive 3.1 metastore
> --
>
> Key: SPARK-24360
> URL: https://issues.apache.org/jira/browse/SPARK-24360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Hive 3.1.0 is released. This issue aims to support Hive Metastore 3.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory

2019-01-30 Thread Kingsley Jones (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756826#comment-16756826
 ] 

Kingsley Jones commented on SPARK-12216:


Well, that fits. No source-code formatter for PowerShell!

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at scala.util.Try$.apply(Try.scala:161)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
> at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory

2019-01-30 Thread Kingsley Jones (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756825#comment-16756825
 ] 

Kingsley Jones commented on SPARK-12216:


{code:powershell}
# # Shell to launch local Apache Spark REPL and do cleanup on close

$sparkid = (Start-Process spark-shell -PassThru).Id # launch spark-shell and 
save process Id
$currdir = Get-Location

Wait-Process -Id $sparkid # wait until process exits to

run garbage collection process

# once the execution flow reaches here the spark-shell has been exited and we 
can clean up

Set-Location $env:TEMP

Remove-Item spark* -Recurse -Force
Remove-Item jansi* -Recurse -Force
Remove-Item hsperfdata* -Recurse -Force

Set-Location $currdir
// code placeholder
{code}

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at scala.util.Try$.apply(Try.scala:161)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
> at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26791) Some scala codes doesn't show friendly and some description about foreachBatch is misleading

2019-01-30 Thread chaiyongqiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chaiyongqiang updated SPARK-26791:
--
Attachment: multi-watermark.jpg
foreachBatch.jpg

> Some scala codes doesn't show friendly and some description about 
> foreachBatch is misleading
> 
>
> Key: SPARK-26791
> URL: https://issues.apache.org/jira/browse/SPARK-26791
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.0
> Environment: NA
>Reporter: chaiyongqiang
>Priority: Minor
> Attachments: foreachBatch.jpg, multi-watermark.jpg
>
>
> [Introduction about 
> foreachbatch|http://spark.apache.org/docs/2.4.0/structured-streaming-programming-guide.html#foreachbatch]
> [Introduction about 
> policy-for-handling-multiple-watermarks|http://spark.apache.org/docs/2.4.0/structured-streaming-programming-guide.html#policy-for-handling-multiple-watermarks]
> The introduction about foreachBatch and 
> policy-for-handling-multiple-watermarks doesn't look good with the scala code.
> Besides, when taking about foreachBatch using the uncache api which doesn't 
> exists, it may be misleading.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions

2019-01-30 Thread Nicholas Resnick (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756801#comment-16756801
 ] 

Nicholas Resnick commented on SPARK-17998:
--

I reproduced the OP's steps above on my local machine and got 5 instead of 7.

I've also noticed this behavior on an EMR cluster, where the same Dataset has 
been read into a varying number of partitions, nondeterministically it seems. 
The trend I've noticed is that the number of partitions seems to equal the 
number of cores allocated to the spark app at the time of the read. Could 
number of cores impact this? Has this changed since the last comment (1/2018)?

> Reading Parquet files coalesces parts into too few in-memory partitions
> ---
>
> Key: SPARK-17998
> URL: https://issues.apache.org/jira/browse/SPARK-17998
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: Spark Standalone Cluster (not "local mode")
> Windows 10 and Windows 7
> Python 3.x
>Reporter: Shea Parkes
>Priority: Major
>
> Reading a parquet ~file into a DataFrame is resulting in far too few 
> in-memory partitions.  In prior versions of Spark, the resulting DataFrame 
> would have a number of partitions often equal to the number of parts in the 
> parquet folder.
> Here's a minimal reproducible sample:
> {quote}
> df_first = session.range(start=1, end=1, numPartitions=13)
> assert df_first.rdd.getNumPartitions() == 13
> assert session._sc.defaultParallelism == 6
> path_scrap = r"c:\scratch\scrap.parquet"
> df_first.write.parquet(path_scrap)
> df_second = session.read.parquet(path_scrap)
> print(df_second.rdd.getNumPartitions())
> {quote}
> The above shows only 7 partitions in the DataFrame that was created by 
> reading the Parquet back into memory for me.  Why is it no longer just the 
> number of part files in the Parquet folder?  (Which is 13 in the example 
> above.)
> I'm filing this as a bug because it has gotten so bad that we can't work with 
> the underlying RDD without first repartitioning the DataFrame, which is 
> costly and wasteful.  I really doubt this was the intended effect of moving 
> to Spark 2.0.
> I've tried to research where the number of in-memory partitions is 
> determined, but my Scala skills have proven in-adequate.  I'd be happy to dig 
> further if someone could point me in the right direction...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26786) Handle to treat escaped newline characters('\r','\n') in spark csv

2019-01-30 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756795#comment-16756795
 ] 

Hyukjin Kwon commented on SPARK-26786:
--

Are you saying the newlines should be escaped even they are quoted?

> Handle to treat escaped newline characters('\r','\n') in spark csv
> --
>
> Key: SPARK-26786
> URL: https://issues.apache.org/jira/browse/SPARK-26786
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: vishnuram selvaraj
>Priority: Major
>
> There are some systems like AWS redshift which writes csv files by escaping 
> newline characters('\r','\n') in addition to escaping the quote characters, 
> if they come as part of the data.
> Redshift documentation 
> link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and 
> below is their mention of escaping requirements in the mentioned link
> ESCAPE
> For CHAR and VARCHAR columns in delimited unload files, an escape character 
> (\{{}}) is placed before every occurrence of the following characters:
>  * Linefeed: {{\n}}
>  * Carriage return: {{\r}}
>  * The delimiter character specified for the unloaded data.
>  * The escape character: \{{}}
>  * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are 
> specified in the UNLOAD command).
>  
> *Problem statement:* 
> But the spark CSV reader doesn't have a handle to treat/remove the escape 
> characters infront of the newline characters in the data.
> It would really help if we can add a feature to handle the escaped newline 
> characters through another parameter like (escapeNewline = 'true/false').
> *Example:*
> Below are the details of my test data set up in a file.
>  * The first record in that file has escaped windows newline character (
>  r
>  n)
>  * The third record in that file has escaped unix newline character (
>  n)
>  * The fifth record in that file has the escaped quote character (")
> the file looks like below in vi editor:
>  
> {code:java}
> "1","this is \^M\
> line1"^M
> "2","this is line2"^M
> "3","this is \
> line3"^M
> "4","this is \" line4"^M
> "5","this is line5"^M{code}
>  
> When I read the file in python's csv module with escape, it is able to remove 
> the added escape characters as you can see below,
>  
> {code:java}
> >>> with open('/tmp/test3.csv','r') as readCsv:
> ... readFile = 
> csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False)
> ... for row in readFile:
> ... print(row)
> ...
> ['1', 'this is \r\n line1']
> ['2', 'this is line2']
> ['3', 'this is \n line3']
> ['4', 'this is " line4']
> ['5', 'this is line5']
> {code}
> But if I read the same file in spark-csv reader, the escape characters 
> infront of the newline characters are not removed.But the escape before the 
> (") is removed.
> {code:java}
> >>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false')
> >>> redDf.show()
> +---+--+
> |_c0| _c1|
> +---+--+
> \ 1|this is \
> line1|
> | 2| this is line2|
> | 3| this is \
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+--+
> {code}
>  *Expected result:*
> {code:java}
> +---+--+
> |_c0| _c1|
> +---+--+
> | 1|this is 
> line1|
> | 2| this is line2|
> | 3| this is 
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26793) Remove spark.shuffle.manager

2019-01-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26793:


Assignee: (was: Apache Spark)

> Remove spark.shuffle.manager
> 
>
> Key: SPARK-26793
> URL: https://issues.apache.org/jira/browse/SPARK-26793
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> Currently, `ShuffleManager` always uses `SortShuffleManager`,  I think this 
> configuration can be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26793) Remove spark.shuffle.manager

2019-01-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26793:


Assignee: Apache Spark

> Remove spark.shuffle.manager
> 
>
> Key: SPARK-26793
> URL: https://issues.apache.org/jira/browse/SPARK-26793
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, `ShuffleManager` always uses `SortShuffleManager`,  I think this 
> configuration can be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26793) Remove spark.shuffle.manager

2019-01-30 Thread liuxian (JIRA)
liuxian created SPARK-26793:
---

 Summary: Remove spark.shuffle.manager
 Key: SPARK-26793
 URL: https://issues.apache.org/jira/browse/SPARK-26793
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: liuxian


Currently, `ShuffleManager` always uses `SortShuffleManager`,  I think this 
configuration can be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26792) Apply custom log URL to Spark UI

2019-01-30 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756754#comment-16756754
 ] 

Jungtaek Lim commented on SPARK-26792:
--

OK. Maybe initiating discussion on dev. mailing list (or user mailing list?) 
would be great before taking action. Thanks for the suggestion.

> Apply custom log URL to Spark UI
> 
>
> Key: SPARK-26792
> URL: https://issues.apache.org/jira/browse/SPARK-26792
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> SPARK-23155 enables SHS to set up custom log URLs for incompleted / completed 
> apps.
> While getting reviews from SPARK-23155, I've got two comments which applying 
> custom log URLs to UI would help achieving it. Quoting these comments here:
> https://github.com/apache/spark/pull/23260#issuecomment-456827963
> {quote}
> Sorry I haven't had time to look through all the code so this might be a 
> separate jira, but one thing I thought of here is it would be really nice not 
> to have specifically stderr/stdout. users can specify any log4j.properties 
> and some tools like oozie by default end up using hadoop log4j rather then 
> spark log4j, so files aren't necessarily the same. Also users can put in 
> other logs files so it would be nice to have links to those from the UI. It 
> seems simpler if we just had a link to the directory and it read the files 
> within there. Other things in Hadoop do it this way, but I'm not sure if that 
> works well for other resource managers, any thoughts on that? As long as this 
> doesn't prevent the above I can file a separate jira for it.
> {quote}
> https://github.com/apache/spark/pull/23260#issuecomment-456904716
> {quote}
> Hi Tom, +1: singling out stdout and stderr is definitely an annoyance. We
> typically configure Spark jobs to write the GC log and dump heap on OOM
> using ,  and/or we use the rolling file appender to deal with
> large logs during debugging. So linking the YARN container log overview
> page would make much more sense for us. We work it around with a custom
> submit process that logs all important URLs on the submit side log.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26792) Apply custom log URL to Spark UI

2019-01-30 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756750#comment-16756750
 ] 

Thomas Graves commented on SPARK-26792:
---

don't see a problem with changing the default in 3.0, its a major release and 
we are allowed to change apis and such so this is no different.  If enough 
people think its a better user experience we should change the default.   It 
would be good to get feedback from more users.

> Apply custom log URL to Spark UI
> 
>
> Key: SPARK-26792
> URL: https://issues.apache.org/jira/browse/SPARK-26792
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> SPARK-23155 enables SHS to set up custom log URLs for incompleted / completed 
> apps.
> While getting reviews from SPARK-23155, I've got two comments which applying 
> custom log URLs to UI would help achieving it. Quoting these comments here:
> https://github.com/apache/spark/pull/23260#issuecomment-456827963
> {quote}
> Sorry I haven't had time to look through all the code so this might be a 
> separate jira, but one thing I thought of here is it would be really nice not 
> to have specifically stderr/stdout. users can specify any log4j.properties 
> and some tools like oozie by default end up using hadoop log4j rather then 
> spark log4j, so files aren't necessarily the same. Also users can put in 
> other logs files so it would be nice to have links to those from the UI. It 
> seems simpler if we just had a link to the directory and it read the files 
> within there. Other things in Hadoop do it this way, but I'm not sure if that 
> works well for other resource managers, any thoughts on that? As long as this 
> doesn't prevent the above I can file a separate jira for it.
> {quote}
> https://github.com/apache/spark/pull/23260#issuecomment-456904716
> {quote}
> Hi Tom, +1: singling out stdout and stderr is definitely an annoyance. We
> typically configure Spark jobs to write the GC log and dump heap on OOM
> using ,  and/or we use the rolling file appender to deal with
> large logs during debugging. So linking the YARN container log overview
> page would make much more sense for us. We work it around with a custom
> submit process that logs all important URLs on the submit side log.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26792) Apply custom log URL to Spark UI

2019-01-30 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756747#comment-16756747
 ] 

Jungtaek Lim commented on SPARK-26792:
--

cc. [~tgraves] [~jira.shegalov]

While I'm not sure we can just change the default (which will change the user 
experience), we can safely apply custom log URLs to UI as well - setting it to 
YARN container log overview page (I guess it's equivalent to the link to the 
directory). What do you think?

> Apply custom log URL to Spark UI
> 
>
> Key: SPARK-26792
> URL: https://issues.apache.org/jira/browse/SPARK-26792
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> SPARK-23155 enables SHS to set up custom log URLs for incompleted / completed 
> apps.
> While getting reviews from SPARK-23155, I've got two comments which applying 
> custom log URLs to UI would help achieving it. Quoting these comments here:
> https://github.com/apache/spark/pull/23260#issuecomment-456827963
> {quote}
> Sorry I haven't had time to look through all the code so this might be a 
> separate jira, but one thing I thought of here is it would be really nice not 
> to have specifically stderr/stdout. users can specify any log4j.properties 
> and some tools like oozie by default end up using hadoop log4j rather then 
> spark log4j, so files aren't necessarily the same. Also users can put in 
> other logs files so it would be nice to have links to those from the UI. It 
> seems simpler if we just had a link to the directory and it read the files 
> within there. Other things in Hadoop do it this way, but I'm not sure if that 
> works well for other resource managers, any thoughts on that? As long as this 
> doesn't prevent the above I can file a separate jira for it.
> {quote}
> https://github.com/apache/spark/pull/23260#issuecomment-456904716
> {quote}
> Hi Tom, +1: singling out stdout and stderr is definitely an annoyance. We
> typically configure Spark jobs to write the GC log and dump heap on OOM
> using ,  and/or we use the rolling file appender to deal with
> large logs during debugging. So linking the YARN container log overview
> page would make much more sense for us. We work it around with a custom
> submit process that logs all important URLs on the submit side log.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26792) Apply custom log URL to Spark UI

2019-01-30 Thread Jungtaek Lim (JIRA)
Jungtaek Lim created SPARK-26792:


 Summary: Apply custom log URL to Spark UI
 Key: SPARK-26792
 URL: https://issues.apache.org/jira/browse/SPARK-26792
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


SPARK-23155 enables SHS to set up custom log URLs for incompleted / completed 
apps.

While getting reviews from SPARK-23155, I've got two comments which applying 
custom log URLs to UI would help achieving it. Quoting these comments here:

https://github.com/apache/spark/pull/23260#issuecomment-456827963

{quote}
Sorry I haven't had time to look through all the code so this might be a 
separate jira, but one thing I thought of here is it would be really nice not 
to have specifically stderr/stdout. users can specify any log4j.properties and 
some tools like oozie by default end up using hadoop log4j rather then spark 
log4j, so files aren't necessarily the same. Also users can put in other logs 
files so it would be nice to have links to those from the UI. It seems simpler 
if we just had a link to the directory and it read the files within there. 
Other things in Hadoop do it this way, but I'm not sure if that works well for 
other resource managers, any thoughts on that? As long as this doesn't prevent 
the above I can file a separate jira for it.
{quote}

https://github.com/apache/spark/pull/23260#issuecomment-456904716

{quote}
Hi Tom, +1: singling out stdout and stderr is definitely an annoyance. We
typically configure Spark jobs to write the GC log and dump heap on OOM
using ,  and/or we use the rolling file appender to deal with
large logs during debugging. So linking the YARN container log overview
page would make much more sense for us. We work it around with a custom
submit process that logs all important URLs on the submit side log.
{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26791) Some scala codes doesn't show friendly and some description about foreachBatch is misleading

2019-01-30 Thread chaiyongqiang (JIRA)
chaiyongqiang created SPARK-26791:
-

 Summary: Some scala codes doesn't show friendly and some 
description about foreachBatch is misleading
 Key: SPARK-26791
 URL: https://issues.apache.org/jira/browse/SPARK-26791
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.4.0
 Environment: NA
Reporter: chaiyongqiang


[Introduction about 
foreachbatch|http://spark.apache.org/docs/2.4.0/structured-streaming-programming-guide.html#foreachbatch]

[Introduction about 
policy-for-handling-multiple-watermarks|http://spark.apache.org/docs/2.4.0/structured-streaming-programming-guide.html#policy-for-handling-multiple-watermarks]

The introduction about foreachBatch and policy-for-handling-multiple-watermarks 
doesn't look good with the scala code.

Besides, when taking about foreachBatch using the uncache api which doesn't 
exists, it may be misleading.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26154) Stream-stream joins - left outer join gives inconsistent output

2019-01-30 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756716#comment-16756716
 ] 

Sean Owen commented on SPARK-26154:
---

Sure, it's a judgment call. If you see people contemplating it as blocking a 
release, I think that change would be fair to make. You can update other JIRA 
elements if you're pretty sure, given your experience with the project, that it 
should have a different label or whatever. 

What we don't want is simply people with no experience in the project marking 
things Blocker and setting a bunch of unuseful flags like 'Important' or 
tagging it 'spark'. If you know enough to be thoughtful about it, I bet your 
edits would be OK. If in doubt ask.

> Stream-stream joins - left outer join gives inconsistent output
> ---
>
> Key: SPARK-26154
> URL: https://issues.apache.org/jira/browse/SPARK-26154
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.2
> Environment: Spark version - Spark 2.3.2
> OS- Suse 11
>Reporter: Haripriya
>Priority: Critical
>  Labels: correctness
>
> Stream-stream joins using left outer join gives inconsistent  output 
> The data processed once, is being processed again and gives null value. In 
> Batch 2, the input data  "3" is processed. But again in batch 6, null value 
> is provided for same data
> Steps
> In spark-shell
> {code:java}
> scala> import org.apache.spark.sql.functions.{col, expr}
> import org.apache.spark.sql.functions.{col, expr}
> scala> import org.apache.spark.sql.streaming.Trigger
> import org.apache.spark.sql.streaming.Trigger
> scala> val lines_stream1 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic1").
>  |   option("includeTimestamp", true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data"),col("timestamp") 
> as("recordTime")).
>  |   select("data","recordTime").
>  |   withWatermark("recordTime", "5 seconds ")
> lines_stream1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data: string, recordTime: timestamp]
> scala> val lines_stream2 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic2").
>  |   option("includeTimestamp", value = true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data1"),col("timestamp") 
> as("recordTime1")).
>  |   select("data1","recordTime1").
>  |   withWatermark("recordTime1", "10 seconds ")
> lines_stream2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data1: string, recordTime1: timestamp]
> scala> val query = lines_stream1.join(lines_stream2, expr (
>  |   """
>  | | data == data1 and
>  | | recordTime1 >= recordTime and
>  | | recordTime1 <= recordTime + interval 5 seconds
>  |   """.stripMargin),"left").
>  |   writeStream.
>  |   option("truncate","false").
>  |   outputMode("append").
>  |   format("console").option("checkpointLocation", 
> "/tmp/leftouter/").
>  |   trigger(Trigger.ProcessingTime ("5 seconds")).
>  |   start()
> query: org.apache.spark.sql.streaming.StreamingQuery = 
> org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@1a48f55b
> {code}
> Step2 : Start producing data
> kafka-console-producer.sh --broker-list ip:9092 --topic topic1
>  >1
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >bb
>  >cc
> kafka-console-producer.sh --broker-list ip:9092 --topic topic2
>  >2
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >cc
>  >ee
>  >ee
>  
> Output obtained:
> {code:java}
> Batch: 0
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 1
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 2
> ---
> ++---+-+---+
> |data|recordTime |data1|recordTime1|
> ++---+-+---+
> |3   |2018-11-22 20:09:35.053|3

[jira] [Commented] (SPARK-26154) Stream-stream joins - left outer join gives inconsistent output

2019-01-30 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756709#comment-16756709
 ] 

Jungtaek Lim commented on SPARK-26154:
--

[~srowen]
Yeah, I just wanted to avoid changing priority by myself given I'm frequently 
seeing comments that 'critical' and 'blocker' are reserved to committers. 
Thanks for justifying and changing it!

> Stream-stream joins - left outer join gives inconsistent output
> ---
>
> Key: SPARK-26154
> URL: https://issues.apache.org/jira/browse/SPARK-26154
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.2
> Environment: Spark version - Spark 2.3.2
> OS- Suse 11
>Reporter: Haripriya
>Priority: Critical
>  Labels: correctness
>
> Stream-stream joins using left outer join gives inconsistent  output 
> The data processed once, is being processed again and gives null value. In 
> Batch 2, the input data  "3" is processed. But again in batch 6, null value 
> is provided for same data
> Steps
> In spark-shell
> {code:java}
> scala> import org.apache.spark.sql.functions.{col, expr}
> import org.apache.spark.sql.functions.{col, expr}
> scala> import org.apache.spark.sql.streaming.Trigger
> import org.apache.spark.sql.streaming.Trigger
> scala> val lines_stream1 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic1").
>  |   option("includeTimestamp", true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data"),col("timestamp") 
> as("recordTime")).
>  |   select("data","recordTime").
>  |   withWatermark("recordTime", "5 seconds ")
> lines_stream1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data: string, recordTime: timestamp]
> scala> val lines_stream2 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic2").
>  |   option("includeTimestamp", value = true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data1"),col("timestamp") 
> as("recordTime1")).
>  |   select("data1","recordTime1").
>  |   withWatermark("recordTime1", "10 seconds ")
> lines_stream2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data1: string, recordTime1: timestamp]
> scala> val query = lines_stream1.join(lines_stream2, expr (
>  |   """
>  | | data == data1 and
>  | | recordTime1 >= recordTime and
>  | | recordTime1 <= recordTime + interval 5 seconds
>  |   """.stripMargin),"left").
>  |   writeStream.
>  |   option("truncate","false").
>  |   outputMode("append").
>  |   format("console").option("checkpointLocation", 
> "/tmp/leftouter/").
>  |   trigger(Trigger.ProcessingTime ("5 seconds")).
>  |   start()
> query: org.apache.spark.sql.streaming.StreamingQuery = 
> org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@1a48f55b
> {code}
> Step2 : Start producing data
> kafka-console-producer.sh --broker-list ip:9092 --topic topic1
>  >1
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >bb
>  >cc
> kafka-console-producer.sh --broker-list ip:9092 --topic topic2
>  >2
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >cc
>  >ee
>  >ee
>  
> Output obtained:
> {code:java}
> Batch: 0
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 1
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 2
> ---
> ++---+-+---+
> |data|recordTime |data1|recordTime1|
> ++---+-+---+
> |3   |2018-11-22 20:09:35.053|3|2018-11-22 20:09:36.506|
> |2   |2018-11-22 20:09:31.613|2|2018-11-22 20:09:33.116|
> ++---+-+---+
> ---
> Batch: 3
> ---
> ++---+-+---+
> |data|recordTime 

[jira] [Updated] (SPARK-26154) Stream-stream joins - left outer join gives inconsistent output

2019-01-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26154:
--
  Labels: correctness  (was: )
Priority: Critical  (was: Major)

[~kabhwan] I think everyone is able to modify the labels and priority (not 
actually by design we just can't restrict it). I made this 'Critical' though 
priorities except 'Blocker' don't mean a lot. And labeled it 'correctness'.

> Stream-stream joins - left outer join gives inconsistent output
> ---
>
> Key: SPARK-26154
> URL: https://issues.apache.org/jira/browse/SPARK-26154
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.2
> Environment: Spark version - Spark 2.3.2
> OS- Suse 11
>Reporter: Haripriya
>Priority: Critical
>  Labels: correctness
>
> Stream-stream joins using left outer join gives inconsistent  output 
> The data processed once, is being processed again and gives null value. In 
> Batch 2, the input data  "3" is processed. But again in batch 6, null value 
> is provided for same data
> Steps
> In spark-shell
> {code:java}
> scala> import org.apache.spark.sql.functions.{col, expr}
> import org.apache.spark.sql.functions.{col, expr}
> scala> import org.apache.spark.sql.streaming.Trigger
> import org.apache.spark.sql.streaming.Trigger
> scala> val lines_stream1 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic1").
>  |   option("includeTimestamp", true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data"),col("timestamp") 
> as("recordTime")).
>  |   select("data","recordTime").
>  |   withWatermark("recordTime", "5 seconds ")
> lines_stream1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data: string, recordTime: timestamp]
> scala> val lines_stream2 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic2").
>  |   option("includeTimestamp", value = true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data1"),col("timestamp") 
> as("recordTime1")).
>  |   select("data1","recordTime1").
>  |   withWatermark("recordTime1", "10 seconds ")
> lines_stream2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data1: string, recordTime1: timestamp]
> scala> val query = lines_stream1.join(lines_stream2, expr (
>  |   """
>  | | data == data1 and
>  | | recordTime1 >= recordTime and
>  | | recordTime1 <= recordTime + interval 5 seconds
>  |   """.stripMargin),"left").
>  |   writeStream.
>  |   option("truncate","false").
>  |   outputMode("append").
>  |   format("console").option("checkpointLocation", 
> "/tmp/leftouter/").
>  |   trigger(Trigger.ProcessingTime ("5 seconds")).
>  |   start()
> query: org.apache.spark.sql.streaming.StreamingQuery = 
> org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@1a48f55b
> {code}
> Step2 : Start producing data
> kafka-console-producer.sh --broker-list ip:9092 --topic topic1
>  >1
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >bb
>  >cc
> kafka-console-producer.sh --broker-list ip:9092 --topic topic2
>  >2
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >cc
>  >ee
>  >ee
>  
> Output obtained:
> {code:java}
> Batch: 0
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 1
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 2
> ---
> ++---+-+---+
> |data|recordTime |data1|recordTime1|
> ++---+-+---+
> |3   |2018-11-22 20:09:35.053|3|2018-11-22 20:09:36.506|
> |2   |2018-11-22 20:09:31.613|2|2018-11-22 20:09:33.116|
> ++---+-+---+
> ---
> Batch: 3
> ---
> 

[jira] [Updated] (SPARK-26786) Handle to treat escaped newline characters('\r','\n') in spark csv

2019-01-30 Thread vishnuram selvaraj (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vishnuram selvaraj updated SPARK-26786:
---
Affects Version/s: (was: 2.4.0)
   2.3.0

> Handle to treat escaped newline characters('\r','\n') in spark csv
> --
>
> Key: SPARK-26786
> URL: https://issues.apache.org/jira/browse/SPARK-26786
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: vishnuram selvaraj
>Priority: Major
>
> There are some systems like AWS redshift which writes csv files by escaping 
> newline characters('\r','\n') in addition to escaping the quote characters, 
> if they come as part of the data.
> Redshift documentation 
> link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and 
> below is their mention of escaping requirements in the mentioned link
> ESCAPE
> For CHAR and VARCHAR columns in delimited unload files, an escape character 
> (\{{}}) is placed before every occurrence of the following characters:
>  * Linefeed: {{\n}}
>  * Carriage return: {{\r}}
>  * The delimiter character specified for the unloaded data.
>  * The escape character: \{{}}
>  * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are 
> specified in the UNLOAD command).
>  
> *Problem statement:* 
> But the spark CSV reader doesn't have a handle to treat/remove the escape 
> characters infront of the newline characters in the data.
> It would really help if we can add a feature to handle the escaped newline 
> characters through another parameter like (escapeNewline = 'true/false').
> *Example:*
> Below are the details of my test data set up in a file.
>  * The first record in that file has escaped windows newline character (
>  r
>  n)
>  * The third record in that file has escaped unix newline character (
>  n)
>  * The fifth record in that file has the escaped quote character (")
> the file looks like below in vi editor:
>  
> {code:java}
> "1","this is \^M\
> line1"^M
> "2","this is line2"^M
> "3","this is \
> line3"^M
> "4","this is \" line4"^M
> "5","this is line5"^M{code}
>  
> When I read the file in python's csv module with escape, it is able to remove 
> the added escape characters as you can see below,
>  
> {code:java}
> >>> with open('/tmp/test3.csv','r') as readCsv:
> ... readFile = 
> csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False)
> ... for row in readFile:
> ... print(row)
> ...
> ['1', 'this is \r\n line1']
> ['2', 'this is line2']
> ['3', 'this is \n line3']
> ['4', 'this is " line4']
> ['5', 'this is line5']
> {code}
> But if I read the same file in spark-csv reader, the escape characters 
> infront of the newline characters are not removed.But the escape before the 
> (") is removed.
> {code:java}
> >>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false')
> >>> redDf.show()
> +---+--+
> |_c0| _c1|
> +---+--+
> \ 1|this is \
> line1|
> | 2| this is line2|
> | 3| this is \
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+--+
> {code}
>  *Expected result:*
> {code:java}
> +---+--+
> |_c0| _c1|
> +---+--+
> | 1|this is 
> line1|
> | 2| this is line2|
> | 3| this is 
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26790) Yarn executor to self-retrieve log urls and attributes

2019-01-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26790:


Assignee: Apache Spark

> Yarn executor to self-retrieve log urls and attributes
> --
>
> Key: SPARK-26790
> URL: https://issues.apache.org/jira/browse/SPARK-26790
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-23155 enables retrieving executor attributes from YARN container via 
> `Container` YARN API which just works but we're still concerning about below:
> - some informations are not provided (NM PORT, etc.) 
> - we provide them as container launch environment on executor which is not 
> the same approach for driver log in yarn-cluster mode
> This issue tracks the effort on addressing above, via let YARN executor 
> self-retrieving attributes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26786) Handle to treat escaped newline characters('\r','\n') in spark csv

2019-01-30 Thread vishnuram selvaraj (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vishnuram selvaraj updated SPARK-26786:
---
Description: 
There are some systems like AWS redshift which writes csv files by escaping 
newline characters('\r','\n') in addition to escaping the quote characters, if 
they come as part of the data.

Redshift documentation 
link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and below 
is their mention of escaping requirements in the mentioned link

ESCAPE

For CHAR and VARCHAR columns in delimited unload files, an escape character 
(\{{}}) is placed before every occurrence of the following characters:
 * Linefeed: {{\n}}

 * Carriage return: {{\r}}

 * The delimiter character specified for the unloaded data.

 * The escape character: \{{}}

 * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are 
specified in the UNLOAD command).

 

*Problem statement:* 

But the spark CSV reader doesn't have a handle to treat/remove the escape 
characters infront of the newline characters in the data.

It would really help if we can add a feature to handle the escaped newline 
characters through another parameter like (escapeNewline = 'true/false').

*Example:*

Below are the details of my test data set up in a file.
 * The first record in that file has escaped windows newline character (
 r
 n)
 * The third record in that file has escaped unix newline character (
 n)
 * The fifth record in that file has the escaped quote character (")

the file looks like below in vi editor:

 
{code:java}
"1","this is \^M\
line1"^M
"2","this is line2"^M
"3","this is \
line3"^M
"4","this is \" line4"^M
"5","this is line5"^M{code}
 

When I read the file in python's csv module with escape, it is able to remove 
the added escape characters as you can see below,

 
{code:java}
>>> with open('/tmp/test3.csv','r') as readCsv:
... readFile = 
csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False)
... for row in readFile:
... print(row)
...
['1', 'this is \r\n line1']
['2', 'this is line2']
['3', 'this is \n line3']
['4', 'this is " line4']
['5', 'this is line5']
{code}
But if I read the same file in spark-csv reader, the escape characters infront 
of the newline characters are not removed.But the escape before the (") is 
removed.
{code:java}
>>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false')
>>> redDf.show()
+---+--+
|_c0| _c1|
+---+--+
\ 1|this is \
line1|
| 2| this is line2|
| 3| this is \
line3|
| 4| this is " line4|
| 5| this is line5|
+---+--+
{code}
 *Expected result:*
{code:java}
+---+--+
|_c0| _c1|
+---+--+
| 1|this is 
line1|
| 2| this is line2|
| 3| this is 
line3|
| 4| this is " line4|
| 5| this is line5|
+---+--+
{code}

  was:
There are some systems like AWS redshift which writes csv files by escaping 
newline characters('\r','\n') in addition to escaping the quote characters, if 
they come as part of the data.

Redshift documentation 
link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and below 
is their mention of escaping requirements in the mentioned link

ESCAPE

For CHAR and VARCHAR columns in delimited unload files, an escape character 
(\{{}}) is placed before every occurrence of the following characters:
 * Linefeed: {{\n}}

 * Carriage return: {{\r}}

 * The delimiter character specified for the unloaded data.

 * The escape character: \{{}}

 * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are 
specified in the UNLOAD command).

 

*Problem statement:* 

But the spark CSV reader doesn't have a handle to treat/remove the escape 
characters infront of the newline characters in the data.

It would really help if we can add a feature to handle the escaped newline 
characters through another parameter like (escapeNewline = 'true/false').

*Example:*

Below are the details of my test data set up in a file.
 * The first record in that file has escaped windows newline character (
r
 n)
 * The third record in that file has escaped unix newline character (
 n)
 * The fifth record in that file has the escaped quote character (")

the file looks like below in vi editor:

 
{code:java}
"1","this is \^M\
line1"^M
"2","this is line2"^M
"3","this is \
line3"^M
"4","this is \" line4"^M
"5","this is line5"^M{code}
 

When I read the file in python's csv module with escape, it is able to remove 
the added escape characters as you can see below,

 
{code:java}
>>> with open('/tmp/test3.csv','r') as readCsv:
... readFile = 
csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False)
... for row in readFile:
... print(row)

[jira] [Updated] (SPARK-26786) Handle to treat escaped newline characters('\r','\n') in spark csv

2019-01-30 Thread vishnuram selvaraj (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vishnuram selvaraj updated SPARK-26786:
---
Component/s: SQL
 PySpark

> Handle to treat escaped newline characters('\r','\n') in spark csv
> --
>
> Key: SPARK-26786
> URL: https://issues.apache.org/jira/browse/SPARK-26786
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: vishnuram selvaraj
>Priority: Major
>
> There are some systems like AWS redshift which writes csv files by escaping 
> newline characters('\r','\n') in addition to escaping the quote characters, 
> if they come as part of the data.
> Redshift documentation 
> link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and 
> below is their mention of escaping requirements in the mentioned link
> ESCAPE
> For CHAR and VARCHAR columns in delimited unload files, an escape character 
> (\{{}}) is placed before every occurrence of the following characters:
>  * Linefeed: {{\n}}
>  * Carriage return: {{\r}}
>  * The delimiter character specified for the unloaded data.
>  * The escape character: \{{}}
>  * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are 
> specified in the UNLOAD command).
>  
> *Problem statement:* 
> But the spark CSV reader doesn't have a handle to treat/remove the escape 
> characters infront of the newline characters in the data.
> It would really help if we can add a feature to handle the escaped newline 
> characters through another parameter like (escapeNewline = 'true/false').
> *Example:*
> Below are the details of my test data set up in a file.
>  * The first record in that file has escaped windows newline character (
>  r
>  n)
>  * The third record in that file has escaped unix newline character (
>  n)
>  * The fifth record in that file has the escaped quote character (")
> the file looks like below in vi editor:
>  
> {code:java}
> "1","this is \^M\
> line1"^M
> "2","this is line2"^M
> "3","this is \
> line3"^M
> "4","this is \" line4"^M
> "5","this is line5"^M{code}
>  
> When I read the file in python's csv module with escape, it is able to remove 
> the added escape characters as you can see below,
>  
> {code:java}
> >>> with open('/tmp/test3.csv','r') as readCsv:
> ... readFile = 
> csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False)
> ... for row in readFile:
> ... print(row)
> ...
> ['1', 'this is \r\n line1']
> ['2', 'this is line2']
> ['3', 'this is \n line3']
> ['4', 'this is " line4']
> ['5', 'this is line5']
> {code}
> But if I read the same file in spark-csv reader, the escape characters 
> infront of the newline characters are not removed.But the escape before the 
> (") is removed.
> {code:java}
> >>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false')
> >>> redDf.show()
> +---+--+
> |_c0| _c1|
> +---+--+
> \ 1|this is \
> line1|
> | 2| this is line2|
> | 3| this is \
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+--+
> {code}
>  *Expected result:*
> {code:java}
> +---+--+
> |_c0| _c1|
> +---+--+
> | 1|this is 
> line1|
> | 2| this is line2|
> | 3| this is 
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26790) Yarn executor to self-retrieve log urls and attributes

2019-01-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26790:


Assignee: (was: Apache Spark)

> Yarn executor to self-retrieve log urls and attributes
> --
>
> Key: SPARK-26790
> URL: https://issues.apache.org/jira/browse/SPARK-26790
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> SPARK-23155 enables retrieving executor attributes from YARN container via 
> `Container` YARN API which just works but we're still concerning about below:
> - some informations are not provided (NM PORT, etc.) 
> - we provide them as container launch environment on executor which is not 
> the same approach for driver log in yarn-cluster mode
> This issue tracks the effort on addressing above, via let YARN executor 
> self-retrieving attributes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23155) YARN-aggregated executor/driver logs appear unavailable when NM is down

2019-01-30 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756667#comment-16756667
 ] 

Jungtaek Lim commented on SPARK-23155:
--

Just added the link of effective PR here. Thanks Vanzin for reminding me.

> YARN-aggregated executor/driver logs appear unavailable when NM is down
> ---
>
> Key: SPARK-23155
> URL: https://issues.apache.org/jira/browse/SPARK-23155
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> Unlike MapReduce JobHistory Server, Spark history server isn't rewriting 
> container log URL's to point to the aggregated yarn.log.server.url location 
> and relies on the NodeManager webUI to trigger a redirect. This fails when 
> the NM is down. Note that NM may be down permanently after decommissioning in 
> traditional environments or when used in a cloud environment such as AWS EMR 
> where either worker nodes are taken away with autoscale, the whole cluster is 
> used to run a single job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26786) Handle to treat escaped newline characters('\r','\n') in spark csv

2019-01-30 Thread vishnuram selvaraj (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vishnuram selvaraj updated SPARK-26786:
---
Issue Type: Bug  (was: New Feature)

> Handle to treat escaped newline characters('\r','\n') in spark csv
> --
>
> Key: SPARK-26786
> URL: https://issues.apache.org/jira/browse/SPARK-26786
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: vishnuram selvaraj
>Priority: Major
>
> There are some systems like AWS redshift which writes csv files by escaping 
> newline characters('\r','\n') in addition to escaping the quote characters, 
> if they come as part of the data.
> Redshift documentation 
> link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and 
> below is their mention of escaping requirements in the mentioned link
> ESCAPE
> For CHAR and VARCHAR columns in delimited unload files, an escape character 
> (\{{}}) is placed before every occurrence of the following characters:
>  * Linefeed: {{\n}}
>  * Carriage return: {{\r}}
>  * The delimiter character specified for the unloaded data.
>  * The escape character: \{{}}
>  * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are 
> specified in the UNLOAD command).
>  
> *Problem statement:* 
> But the spark CSV reader doesn't have a handle to treat/remove the escape 
> characters infront of the newline characters in the data.
> It would really help if we can add a feature to handle the escaped newline 
> characters through another parameter like (escapeNewline = 'true/false').
> *Example:*
> Below are the details of my test data set up in a file.
>  * The first record in that file has escaped windows newline character (
> r
>  n)
>  * The third record in that file has escaped unix newline character (
>  n)
>  * The fifth record in that file has the escaped quote character (")
> the file looks like below in vi editor:
>  
> {code:java}
> "1","this is \^M\
> line1"^M
> "2","this is line2"^M
> "3","this is \
> line3"^M
> "4","this is \" line4"^M
> "5","this is line5"^M{code}
>  
> When I read the file in python's csv module with escape, it is able to remove 
> the added escape characters as you can see below,
>  
> {code:java}
> >>> with open('/tmp/test3.csv','r') as readCsv:
> ... readFile = 
> csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False)
> ... for row in readFile:
> ... print(row)
> ...
> ['1', 'this is \r\n line1']
> ['2', 'this is line2']
> ['3', 'this is \n line3']
> ['4', 'this is " line4']
> ['5', 'this is line5']
> {code}
> But if I read the same file in spark-csv reader, the escape characters 
> infront of the newline characters are not removed.But the escape before the 
> (") is removed.
> {code:java}
> >>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false')
> >>> redDf.show()
> +---+--+
> |_c0| _c1|
> +---+--+
> \ 1|this is \
> line1|
> | 2| this is line2|
> | 3| this is \
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+--+
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-30 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756654#comment-16756654
 ] 

Ryan Blue commented on SPARK-26677:
---

Thanks, sorry about the mistake.

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Blocker
>  Labels: correctness
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26790) Yarn executor to self-retrieve log urls and attributes

2019-01-30 Thread Jungtaek Lim (JIRA)
Jungtaek Lim created SPARK-26790:


 Summary: Yarn executor to self-retrieve log urls and attributes
 Key: SPARK-26790
 URL: https://issues.apache.org/jira/browse/SPARK-26790
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


SPARK-23155 enables retrieving executor attributes from YARN container via 
`Container` YARN API which just works but we're still concerning about below:

- some informations are not provided (NM PORT, etc.) 
- we provide them as container launch environment on executor which is not the 
same approach for driver log in yarn-cluster mode

This issue tracks the effort on addressing above, via let YARN executor 
self-retrieving attributes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23155) YARN-aggregated executor/driver logs appear unavailable when NM is down

2019-01-30 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756602#comment-16756602
 ] 

Marcelo Vanzin commented on SPARK-23155:


For those looking the patch is actually linked from SPARK-26311.

> YARN-aggregated executor/driver logs appear unavailable when NM is down
> ---
>
> Key: SPARK-23155
> URL: https://issues.apache.org/jira/browse/SPARK-23155
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> Unlike MapReduce JobHistory Server, Spark history server isn't rewriting 
> container log URL's to point to the aggregated yarn.log.server.url location 
> and relies on the NodeManager webUI to trigger a redirect. This fails when 
> the NM is down. Note that NM may be down permanently after decommissioning in 
> traditional environments or when used in a cloud environment such as AWS EMR 
> where either worker nodes are taken away with autoscale, the whole cluster is 
> used to run a single job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-30 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756609#comment-16756609
 ] 

Dongjoon Hyun commented on SPARK-26677:
---

Hi, [~rdblue]. I moved `2.4.1` from `Fixed Versions` field to `Target Versions` 
since it's not merged to `branch-2.4` yet.

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Blocker
>  Labels: correctness
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-30 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26677:
--
Target Version/s: 2.4.1
   Fix Version/s: (was: 2.4.1)

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Blocker
>  Labels: correctness
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23155) YARN-aggregated executor/driver logs appear unavailable when NM is down

2019-01-30 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756593#comment-16756593
 ] 

Jungtaek Lim commented on SPARK-23155:
--

[~vanzin]
Could we mark this as resolved and fix version for 3.0.0? I can mark the status 
and fix version (maybe) but cannot change assignee.

> YARN-aggregated executor/driver logs appear unavailable when NM is down
> ---
>
> Key: SPARK-23155
> URL: https://issues.apache.org/jira/browse/SPARK-23155
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>Priority: Major
>
> Unlike MapReduce JobHistory Server, Spark history server isn't rewriting 
> container log URL's to point to the aggregated yarn.log.server.url location 
> and relies on the NodeManager webUI to trigger a redirect. This fails when 
> the NM is down. Note that NM may be down permanently after decommissioning in 
> traditional environments or when used in a cloud environment such as AWS EMR 
> where either worker nodes are taken away with autoscale, the whole cluster is 
> used to run a single job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23155) YARN-aggregated executor/driver logs appear unavailable when NM is down

2019-01-30 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23155.

   Resolution: Fixed
 Assignee: Jungtaek Lim
Fix Version/s: 3.0.0

> YARN-aggregated executor/driver logs appear unavailable when NM is down
> ---
>
> Key: SPARK-23155
> URL: https://issues.apache.org/jira/browse/SPARK-23155
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> Unlike MapReduce JobHistory Server, Spark history server isn't rewriting 
> container log URL's to point to the aggregated yarn.log.server.url location 
> and relies on the NodeManager webUI to trigger a redirect. This fails when 
> the NM is down. Note that NM may be down permanently after decommissioning in 
> traditional environments or when used in a cloud environment such as AWS EMR 
> where either worker nodes are taken away with autoscale, the whole cluster is 
> used to run a single job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22798) Add multiple column support to PySpark StringIndexer

2019-01-30 Thread Huaxin Gao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756562#comment-16756562
 ] 

Huaxin Gao commented on SPARK-22798:


I will submit a PR soon. Thanks. 

> Add multiple column support to PySpark StringIndexer
> 
>
> Key: SPARK-22798
> URL: https://issues.apache.org/jira/browse/SPARK-22798
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26771) Make .unpersist(), .destroy() consistently non-blocking by default

2019-01-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26771:
--
Docs Text: The RDD and DataFrame .unpersist() method, and Broadcast 
.destroy() method, take an optional 'blocking' argument. The default was 
'false' in all cases except for (Scala) RDDs and their GraphX subclasses. The 
default is now 'false' (non-blocking) in all of these methods. Pyspark's RDD 
and Broadcast classes now have an optional 'blocking' argument as well, with 
the same behavior.  (was: The RDD and DataFrame .unpersist() method, and 
Broadcast .destroy() method, take an optional 'blocking' argument. The default 
was 'false' in all cases except for (Scala) RDDs and their GraphX subclasses. 
The default is now 'false' (non-blocking) in all of these methods.)

> Make .unpersist(), .destroy() consistently non-blocking by default
> --
>
> Key: SPARK-26771
> URL: https://issues.apache.org/jira/browse/SPARK-26771
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX, Spark Core
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>  Labels: release-notes
>
> See https://issues.apache.org/jira/browse/SPARK-26728 and 
> https://github.com/apache/spark/pull/23650 . 
> RDD and DataFrame expose an .unpersist() method with optional "blocking" 
> argument. So does Broadcast.destroy(). This argument is false by default 
> except for the Scala RDD (not Pyspark) implementation and its GraphX 
> subclasses. Most usages of these methods request non-blocking behavior 
> already, and indeed, it's not typical to want to wait for the resources to be 
> freed, except in tests asserting behavior about these methods (where blocking 
> is typically requested).
> This proposes to make the default false across these methods, and adjust 
> callers to only request non-default blocking behavior where important, such 
> as in a few key tests. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26789) [k8s] pyspark needs to upload local resources to driver and executor pods

2019-01-30 Thread Oleg Frenkel (JIRA)
Oleg Frenkel created SPARK-26789:


 Summary: [k8s] pyspark needs to upload local resources to driver 
and executor pods
 Key: SPARK-26789
 URL: https://issues.apache.org/jira/browse/SPARK-26789
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes, PySpark
Affects Versions: 2.4.0
Reporter: Oleg Frenkel


Kubernetes support provided with [https://github.com/apache-spark-on-k8s/spark] 
allows local dependencies to be used with cluster deployment mode. 
Specifically, the Resource Staging Server is used in order to upload local 
dependencies to Kubernetes so that driver and executor pods can download these 
dependencies. It looks that Spark 2.4.0 release does not support local 
dependencies. 

For example, the following command is expected to automatically upload pi.py 
from local machine to the Kubernetes cluster and make it available for both 
driver and executor pods:

{{bin/spark-submit --conf spark.app.name=example.python.pi --master 
k8s://http://127.0.0.1:8001 --deploy-mode cluster --conf 
spark.kubernetes.container.image=spark-py:spark-2.4.0 
./examples/src/main/python/pi.py}}

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26784) Allow running driver pod as provided user

2019-01-30 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26784.

Resolution: Won't Fix

As far as I know you can do that with pod templates. So no point in adding 
explicit options for this.

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.12/#securitycontext-v1-core


> Allow running driver pod as provided user
> -
>
> Key: SPARK-26784
> URL: https://issues.apache.org/jira/browse/SPARK-26784
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Alexander Mukhopad
>Priority: Major
>
> Add possibility to override _Dockerfile_'s _USER_ directive by adding 
> option/s to spark-submit, specifying username:group and/or uid:guid
> https://github.com/apache/spark/blob/master/docs/running-on-kubernetes.md#user-identity



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26777) SQL worked in 2.3.2 and fails in 2.4.0

2019-01-30 Thread Yuri Budilov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756515#comment-16756515
 ] 

Yuri Budilov commented on SPARK-26777:
--

good luck. 

> SQL worked in 2.3.2 and fails in 2.4.0
> --
>
> Key: SPARK-26777
> URL: https://issues.apache.org/jira/browse/SPARK-26777
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Yuri Budilov
>Priority: Major
>
> Following SQL worked in Spark 2.3.2 and now fails on 2.4.0 (AWS EMR Spark)
>  PySpark call below:
> spark.sql("select partition_year_utc,partition_month_utc,partition_day_utc \
> from datalake_reporting.copy_of_leads_notification \
> where partition_year_utc = (select max(partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification) \
> and partition_month_utc = \
>  (select max(partition_month_utc) from 
> datalake_reporting.copy_of_leads_notification as m \
>  where \
>  m.partition_year_utc = (select max(partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification)) \
>  and partition_day_utc = (select max(d.partition_day_utc) from 
> datalake_reporting.copy_of_leads_notification as d \
>  where d.partition_month_utc = \
>  (select max(m1.partition_month_utc) from 
> datalake_reporting.copy_of_leads_notification as m1 \
>  where m1.partition_year_utc = \
>  (select max(y.partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification as y) \
>  ) \
>  ) \
>  order by 1 desc, 2 desc, 3 desc limit 1 ").show(1,False)
> Error: (no need for data, this is syntax).
> py4j.protocol.Py4JJavaError: An error occurred while calling o1326.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#4495 []
>  
> Note: all 3 columns in query are Partitioned columns - see bottom of the 
> schema)
>  
> Hive EMR AWS Schema is:
>  
> CREATE EXTERNAL TABLE `copy_of_leads_notification`(
> `message.environment.siteorigin` string, `dcpheader.dcploaddateutc` string, 
> `message.id` int, `source.properties._country` string, `message.created` 
> string, `dcpheader.generatedmessageid` string, `message.tags` bigint, 
> `source.properties._enqueuedtimeutc` string, `source.properties._leadtype` 
> string, `message.itemid` string, `message.prospect.postcode` string, 
> `message.prospect.email` string, `message.referenceid` string, 
> `message.item.year` string, `message.identifier` string, 
> `dcpheader.dcploadmonthutc` string, `message.processed` string, 
> `source.properties._tenant` string, `message.item.price` string, 
> `message.subscription.confirmresponse` boolean, `message.itemtype` string, 
> `message.prospect.lastname` string, `message.subscription.insurancequote` 
> boolean, `source.exchangename` string, 
> `message.prospect.identificationnumbers` bigint, 
> `message.environment.ipaddress` string, `dcpheader.dcploaddayutc` string, 
> `source.properties._itemtype` string, `source.properties._requesttype` 
> string, `message.item.make` string, `message.prospect.firstname` string, 
> `message.subscription.survey` boolean, `message.prospect.homephone` string, 
> `message.extendedproperties` bigint, `message.subscription.financequote` 
> boolean, `message.uniqueidentifier` string, `source.properties._id` string, 
> `dcpheader.sourcemessageguid` string, `message.requesttype` string, 
> `source.routingkey` string, `message.service` string, `message.item.model` 
> string, `message.environment.pagesource` string, `source.source` string, 
> `message.sellerid` string, `partition_date_utc` string, 
> `message.selleridentifier` string, `message.subscription.newsletter` boolean, 
> `dcpheader.dcploadyearutc` string, `message.leadtype` string, 
> `message.history` bigint, `message.callconnect.calloutcome` string, 
> `message.callconnect.datecreatedutc` string, 
> `message.callconnect.callrecordingurl` string, 
> `message.callconnect.transferoutcome` string, 
> `message.callconnect.hiderecording` boolean, 
> `message.callconnect.callstartutc` string, `message.callconnect.code` string, 
> `message.callconnect.callduration` string, `message.fraudnetinfo` string, 
> `message.callconnect.answernumber` string, `message.environment.sourcedevice` 
> string, `message.comments` string, `message.fraudinfo.servervariables` 
> bigint, `message.callconnect.servicenumber` string, 
> `message.callconnect.callid` string, `message.callconnect.voicemailurl` 
> string, `message.item.stocknumber` string, 
> `message.callconnect.answerduration` string, `message.callconnect.callendutc` 
> string, `message.item.series` string, `message.item.detailsurl` string, 
> `message.item.pricetype` string, `message.item.description` string, 
> `message.item.colour` string, `message.item.badge` string, 
> `message.item.odometer` string, 

[jira] [Created] (SPARK-26788) Remove SchedulerExtensionService

2019-01-30 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-26788:
--

 Summary: Remove SchedulerExtensionService
 Key: SPARK-26788
 URL: https://issues.apache.org/jira/browse/SPARK-26788
 Project: Spark
  Issue Type: Task
  Components: YARN
Affects Versions: 3.0.0
Reporter: Marcelo Vanzin


This was added in SPARK-11314, but it has a few issues:

- it's in the YARN module, which is not a public Spark API.
- because of that it's also YARN specific
- it's not used by Spark in any way other than providing this as an extension 
point

For the latter, it probably makes sense to use listeners instead, and enhance 
the listener interface if it's lacking.

Also pinging [~steve_l] since he added that code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26737) Executor/Task STDERR & STDOUT log urls are not correct in Yarn deployment mode

2019-01-30 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-26737:
---
Fix Version/s: 3.0.0

> Executor/Task STDERR & STDOUT log urls are not correct in Yarn deployment mode
> --
>
> Key: SPARK-26737
> URL: https://issues.apache.org/jira/browse/SPARK-26737
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI, YARN
>Affects Versions: 3.0.0
>Reporter: Devaraj K
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> Base of the STDERR & STDOUT log urls are generating like these which is also 
> including key,
> {code}
> http://ip:8042/node/containerlogs/container_1544212645385_0252_01_01/(SPARK_USER,
>  devaraj)
> {code}
> {code}
> http://ip:8042/node/containerlogs/container_1544212645385_0252_01_01/(USER,
>  devaraj)
> {code}
> Instead of 
> {code}http://ip:8042/node/containerlogs/container_1544212645385_0251_01_02/devaraj
>  {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26737) Executor/Task STDERR & STDOUT log urls are not correct in Yarn deployment mode

2019-01-30 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26737.

Resolution: Fixed
  Assignee: Jungtaek Lim

The patch for SPARK-26311 ended up fixing this issue too.

> Executor/Task STDERR & STDOUT log urls are not correct in Yarn deployment mode
> --
>
> Key: SPARK-26737
> URL: https://issues.apache.org/jira/browse/SPARK-26737
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI, YARN
>Affects Versions: 3.0.0
>Reporter: Devaraj K
>Assignee: Jungtaek Lim
>Priority: Major
>
> Base of the STDERR & STDOUT log urls are generating like these which is also 
> including key,
> {code}
> http://ip:8042/node/containerlogs/container_1544212645385_0252_01_01/(SPARK_USER,
>  devaraj)
> {code}
> {code}
> http://ip:8042/node/containerlogs/container_1544212645385_0252_01_01/(USER,
>  devaraj)
> {code}
> Instead of 
> {code}http://ip:8042/node/containerlogs/container_1544212645385_0251_01_02/devaraj
>  {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2019-01-30 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26311:
--

Assignee: Jungtaek Lim

> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26787) Fix standardization error message in WeightedLeastSquares

2019-01-30 Thread Brian Scannell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Scannell updated SPARK-26787:
---
Environment: 
Tested in Spark 2.4.0 on DataBricks running in 5.1 ML Beta.

 

  was:
Tested in Spark 2.4.0 on DataBricks running in 5.1 ML Beta. The following 
Python code will replicate the error. 
{code:java}
import pandas as pd
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

df = pd.DataFrame({'foo': [1,2,3], 'bar':[4,5,6],'label':[1,1,1]})
spark_df = spark.createDataFrame(df)

vectorAssembler = VectorAssembler(inputCols = ['foo', 'bar'], outputCol = 
'features')
train_sdf = vectorAssembler.transform(spark_df).select(['features', 'label'])

lr = LinearRegression(featuresCol='features', labelCol='label', 
fitIntercept=False, standardization=False, regParam=1e-4)

lr_model = lr.fit(train_sdf)
{code}
 

For context, the reason someone might want to do this is if they are trying to 
fit a model to estimate components of a fixed total. The label indicates the 
total is always 100%, but the components vary. For example, trying to estimate 
the unknown weights of different quantities of substances in a series of full 
bins. 

 

 

Description: 
There is an error message in WeightedLeastSquares.scala that is incorrect and 
thus not very helpful for diagnosing an issue. The problem arises when doing 
regularized LinearRegression on a constant label. Even when the parameter 
standardization=False, the error will falsely state that standardization was 
set to True:

{{The standard deviation of the label is zero. Model cannot be regularized with 
standardization=true}}

This is because under the hood, LinearRegression automatically sets a parameter 
standardizeLabel=True. This was chosen for consistency with GLMNet, although 
WeightedLeastSquares is written to allow standardizeLabel to be set either way 
and work (although the public LinearRegression API does not allow it).

 

I will submit a pull request with my suggested wording.

 

Relevant:

[https://github.com/apache/spark/pull/10702]

[https://github.com/apache/spark/pull/10274/commits/d591989f7383b713110750f80b2720bcf24814b5]
 

 

The following Python code will replicate the error. 
{code:java}
import pandas as pd
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

df = pd.DataFrame({'foo': [1,2,3], 'bar':[4,5,6],'label':[1,1,1]})
spark_df = spark.createDataFrame(df)

vectorAssembler = VectorAssembler(inputCols = ['foo', 'bar'], outputCol = 
'features')
train_sdf = vectorAssembler.transform(spark_df).select(['features', 'label'])

lr = LinearRegression(featuresCol='features', labelCol='label', 
fitIntercept=False, standardization=False, regParam=1e-4)

lr_model = lr.fit(train_sdf)
{code}
 

For context, the reason someone might want to do this is if they are trying to 
fit a model to estimate components of a fixed total. The label indicates the 
total is always 100%, but the components vary. For example, trying to estimate 
the unknown weights of different quantities of substances in a series of full 
bins. 

  was:
There is an error message in WeightedLeastSquares.scala that is incorrect and 
thus not very helpful for diagnosing an issue. The problem arises when doing 
regularized LinearRegression on a constant label. Even when the parameter 
standardization=False, the error will falsely state that standardization was 
set to True:

{{The standard deviation of the label is zero. Model cannot be regularized with 
standardization=true}}

This is because under the hood, LinearRegression automatically sets a parameter 
standardizeLabel=True. This was chosen for consistency with GLMNet, although 
WeightedLeastSquares is written to allow standardizeLabel to be set either way 
and work (although the public LinearRegression API does not allow it).

 

I will submit a pull request with my suggested wording.

 

Relevant:

[https://github.com/apache/spark/pull/10702]

[https://github.com/apache/spark/pull/10274/commits/d591989f7383b713110750f80b2720bcf24814b5]
 


> Fix standardization error message in WeightedLeastSquares
> -
>
> Key: SPARK-26787
> URL: https://issues.apache.org/jira/browse/SPARK-26787
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
> Environment: Tested in Spark 2.4.0 on DataBricks running in 5.1 ML 
> Beta.
>  
>Reporter: Brian Scannell
>Priority: Trivial
>
> There is an error message in WeightedLeastSquares.scala that is incorrect and 
> thus not very helpful for diagnosing an issue. The problem arises when doing 
> regularized LinearRegression on a constant label. Even when the parameter 
> standardization=False, the error will 

[jira] [Assigned] (SPARK-26787) Fix standardization error message in WeightedLeastSquares

2019-01-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26787:


Assignee: Apache Spark

> Fix standardization error message in WeightedLeastSquares
> -
>
> Key: SPARK-26787
> URL: https://issues.apache.org/jira/browse/SPARK-26787
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
> Environment: Tested in Spark 2.4.0 on DataBricks running in 5.1 ML 
> Beta. The following Python code will replicate the error. 
> {code:java}
> import pandas as pd
> from pyspark.ml.feature import VectorAssembler
> from pyspark.ml.regression import LinearRegression
> df = pd.DataFrame({'foo': [1,2,3], 'bar':[4,5,6],'label':[1,1,1]})
> spark_df = spark.createDataFrame(df)
> vectorAssembler = VectorAssembler(inputCols = ['foo', 'bar'], outputCol = 
> 'features')
> train_sdf = vectorAssembler.transform(spark_df).select(['features', 'label'])
> lr = LinearRegression(featuresCol='features', labelCol='label', 
> fitIntercept=False, standardization=False, regParam=1e-4)
> lr_model = lr.fit(train_sdf)
> {code}
>  
> For context, the reason someone might want to do this is if they are trying 
> to fit a model to estimate components of a fixed total. The label indicates 
> the total is always 100%, but the components vary. For example, trying to 
> estimate the unknown weights of different quantities of substances in a 
> series of full bins. 
>  
>  
>Reporter: Brian Scannell
>Assignee: Apache Spark
>Priority: Trivial
>
> There is an error message in WeightedLeastSquares.scala that is incorrect and 
> thus not very helpful for diagnosing an issue. The problem arises when doing 
> regularized LinearRegression on a constant label. Even when the parameter 
> standardization=False, the error will falsely state that standardization was 
> set to True:
> {{The standard deviation of the label is zero. Model cannot be regularized 
> with standardization=true}}
> This is because under the hood, LinearRegression automatically sets a 
> parameter standardizeLabel=True. This was chosen for consistency with GLMNet, 
> although WeightedLeastSquares is written to allow standardizeLabel to be set 
> either way and work (although the public LinearRegression API does not allow 
> it).
>  
> I will submit a pull request with my suggested wording.
>  
> Relevant:
> [https://github.com/apache/spark/pull/10702]
> [https://github.com/apache/spark/pull/10274/commits/d591989f7383b713110750f80b2720bcf24814b5]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26787) Fix standardization error message in WeightedLeastSquares

2019-01-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26787:


Assignee: (was: Apache Spark)

> Fix standardization error message in WeightedLeastSquares
> -
>
> Key: SPARK-26787
> URL: https://issues.apache.org/jira/browse/SPARK-26787
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
> Environment: Tested in Spark 2.4.0 on DataBricks running in 5.1 ML 
> Beta. The following Python code will replicate the error. 
> {code:java}
> import pandas as pd
> from pyspark.ml.feature import VectorAssembler
> from pyspark.ml.regression import LinearRegression
> df = pd.DataFrame({'foo': [1,2,3], 'bar':[4,5,6],'label':[1,1,1]})
> spark_df = spark.createDataFrame(df)
> vectorAssembler = VectorAssembler(inputCols = ['foo', 'bar'], outputCol = 
> 'features')
> train_sdf = vectorAssembler.transform(spark_df).select(['features', 'label'])
> lr = LinearRegression(featuresCol='features', labelCol='label', 
> fitIntercept=False, standardization=False, regParam=1e-4)
> lr_model = lr.fit(train_sdf)
> {code}
>  
> For context, the reason someone might want to do this is if they are trying 
> to fit a model to estimate components of a fixed total. The label indicates 
> the total is always 100%, but the components vary. For example, trying to 
> estimate the unknown weights of different quantities of substances in a 
> series of full bins. 
>  
>  
>Reporter: Brian Scannell
>Priority: Trivial
>
> There is an error message in WeightedLeastSquares.scala that is incorrect and 
> thus not very helpful for diagnosing an issue. The problem arises when doing 
> regularized LinearRegression on a constant label. Even when the parameter 
> standardization=False, the error will falsely state that standardization was 
> set to True:
> {{The standard deviation of the label is zero. Model cannot be regularized 
> with standardization=true}}
> This is because under the hood, LinearRegression automatically sets a 
> parameter standardizeLabel=True. This was chosen for consistency with GLMNet, 
> although WeightedLeastSquares is written to allow standardizeLabel to be set 
> either way and work (although the public LinearRegression API does not allow 
> it).
>  
> I will submit a pull request with my suggested wording.
>  
> Relevant:
> [https://github.com/apache/spark/pull/10702]
> [https://github.com/apache/spark/pull/10274/commits/d591989f7383b713110750f80b2720bcf24814b5]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26775) Update Jenkins nodes to support local volumes for K8s integration tests

2019-01-30 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756415#comment-16756415
 ] 

shane knapp edited comment on SPARK-26775 at 1/30/19 7:23 PM:
--

ill install the latest version on my test worker and run a couple of builds to 
check for compatibility...  it should be fine...

...i hope.  :)

this is definitely out of scope for this issue btw and it won't be a gating 
factor for resolving this.

also, i launched a test for the new k8s version:
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/7593/

-looks good so far!- the build failed, sadly:

{noformat}
- Test PVs with local storage *** FAILED ***
  The code passed to eventually never returned normally. Attempted 64 times 
over 2.01437391135 minutes. Last failure message: container not found 
("spark-kubernetes-driver"). (PVTestsSuite.scala:119)
{noformat}



was (Author: shaneknapp):
ill install the latest version on my test worker and run a couple of builds to 
check for compatibility...  it should be fine...

...i hope.  :)

this is definitely out of scope for this issue btw and it won't be a gating 
factor for resolving this.

also, i launched a test for the new k8s version:
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/7593/

looks good so far!

> Update Jenkins nodes to support local volumes for K8s integration tests
> ---
>
> Key: SPARK-26775
> URL: https://issues.apache.org/jira/browse/SPARK-26775
> Project: Spark
>  Issue Type: Improvement
>  Components: jenkins, Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Assignee: shane knapp
>Priority: Major
>
> Current version of Minikube on test machines does not support properly the 
> local persistent volume feature required by this PR: 
> [https://github.com/apache/spark/pull/23514].
> We get his error:
> "spec.local: Forbidden: Local volumes are disabled by feature-gate, 
> metadata.annotations: Required value: Local volume requires node affinity"
> This is probably due to this: 
> [https://github.com/rancher/rancher/issues/13864] which implies that we need 
> to update to 1.10+ as described in 
> [https://kubernetes.io/docs/concepts/storage/volumes/#local]. Fabric8io 
> client is already updated in the PR mentioned at the beginning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26787) Fix standardization error message in WeightedLeastSquares

2019-01-30 Thread Brian Scannell (JIRA)
Brian Scannell created SPARK-26787:
--

 Summary: Fix standardization error message in WeightedLeastSquares
 Key: SPARK-26787
 URL: https://issues.apache.org/jira/browse/SPARK-26787
 Project: Spark
  Issue Type: Documentation
  Components: MLlib
Affects Versions: 2.4.0, 2.3.1, 2.3.0
 Environment: Tested in Spark 2.4.0 on DataBricks running in 5.1 ML 
Beta. The following Python code will replicate the error. 
{code:java}
import pandas as pd
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

df = pd.DataFrame({'foo': [1,2,3], 'bar':[4,5,6],'label':[1,1,1]})
spark_df = spark.createDataFrame(df)

vectorAssembler = VectorAssembler(inputCols = ['foo', 'bar'], outputCol = 
'features')
train_sdf = vectorAssembler.transform(spark_df).select(['features', 'label'])

lr = LinearRegression(featuresCol='features', labelCol='label', 
fitIntercept=False, standardization=False, regParam=1e-4)

lr_model = lr.fit(train_sdf)
{code}
 

For context, the reason someone might want to do this is if they are trying to 
fit a model to estimate components of a fixed total. The label indicates the 
total is always 100%, but the components vary. For example, trying to estimate 
the unknown weights of different quantities of substances in a series of full 
bins. 

 

 
Reporter: Brian Scannell


There is an error message in WeightedLeastSquares.scala that is incorrect and 
thus not very helpful for diagnosing an issue. The problem arises when doing 
regularized LinearRegression on a constant label. Even when the parameter 
standardization=False, the error will falsely state that standardization was 
set to True:

{{The standard deviation of the label is zero. Model cannot be regularized with 
standardization=true}}

This is because under the hood, LinearRegression automatically sets a parameter 
standardizeLabel=True. This was chosen for consistency with GLMNet, although 
WeightedLeastSquares is written to allow standardizeLabel to be set either way 
and work (although the public LinearRegression API does not allow it).

 

I will submit a pull request with my suggested wording.

 

Relevant:

[https://github.com/apache/spark/pull/10702]

[https://github.com/apache/spark/pull/10274/commits/d591989f7383b713110750f80b2720bcf24814b5]
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26718) Fixed integer overflow in SS kafka rateLimit calculation

2019-01-30 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26718:
--
Fix Version/s: 2.3.3

> Fixed integer overflow in SS kafka rateLimit calculation
> 
>
> Key: SPARK-26718
> URL: https://issues.apache.org/jira/browse/SPARK-26718
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Ryne Yang
>Assignee: Ryne Yang
>Priority: Major
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> when running spark structured streaming using lib: `"org.apache.spark" %% 
> "spark-sql-kafka-0-10" % "2.4.0"`, we keep getting error regarding current 
> offset fetching:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 0.0 (TID 3, qa2-hdp-4.acuityads.org, executor 2): 
> java.lang.AssertionError: assertion failed: latest offs
> et -9223372036854775808 does not equal -1
> at scala.Predef$.assert(Predef.scala:170)
> at 
> org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartitionReader.resolveRange(KafkaMicroBatchReader.scala:371)
> at 
> org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartitionReader.(KafkaMicroBatchReader.scala:329)
> at 
> org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartition.createPartitionReader(KafkaMicroBatchReader.scala:314)
> at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:42)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> for some reason, looks like fetchLatestOffset returned a Long.MIN_VALUE for 
> one of the partitions. I checked the structured streaming checkpoint, that 
> was correct, it's the currentAvailableOffset was set to Long.MIN_VALUE.
> kafka broker version: 1.1.0.
> lib we used:
> {\{libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % 
> "2.4.0" }}
> how to reproduce:
> basically we started a structured streamer and subscribed a topic of 4 
> partitions. then produced some messages into topic, job crashed and logged 
> the stacktrace like above.
> also the committed offsets seem fine as we see in the logs: 
> {code:java}
> === Streaming Query ===
> Identifier: [id = c46c67ee-3514-4788-8370-a696837b21b1, runId = 
> 31878627-d473-4ee8-955d-d4d3f3f45eb9]
> Current Committed Offsets: {KafkaV2[Subscribe[REVENUEEVENT]]: 
> {"REVENUEEVENT":{"0":1}}}
> Current Available Offsets: {KafkaV2[Subscribe[REVENUEEVENT]]: 
> {"REVENUEEVENT":{"0":-9223372036854775808}}}
> {code}
> so spark streaming recorded the correct value for partition: 0, but the 
> current available offsets returned from kafka is showing Long.MIN_VALUE. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26753) Log4j customization not working for spark-shell

2019-01-30 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26753.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23675
[https://github.com/apache/spark/pull/23675]

> Log4j customization not working for spark-shell
> ---
>
> Key: SPARK-26753
> URL: https://issues.apache.org/jira/browse/SPARK-26753
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Ankur Gupta
>Assignee: Ankur Gupta
>Priority: Major
> Fix For: 3.0.0
>
>
> It's pretty common to add log4j entries to customize the level of specific 
> loggers. e.g. adding the following to log4j.properties:
> {code:java}
> log4j.logger.org.apache.spark.deploy.security=DEBUG
> {code}
> This works fine on previous releases but not for the current build for 
> spark-shell. This is probably caused by SPARK-25118.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26753) Log4j customization not working for spark-shell

2019-01-30 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26753:
--

Assignee: Ankur Gupta

> Log4j customization not working for spark-shell
> ---
>
> Key: SPARK-26753
> URL: https://issues.apache.org/jira/browse/SPARK-26753
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Ankur Gupta
>Assignee: Ankur Gupta
>Priority: Major
>
> It's pretty common to add log4j entries to customize the level of specific 
> loggers. e.g. adding the following to log4j.properties:
> {code:java}
> log4j.logger.org.apache.spark.deploy.security=DEBUG
> {code}
> This works fine on previous releases but not for the current build for 
> spark-shell. This is probably caused by SPARK-25118.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26775) Update Jenkins nodes to support local volumes for K8s integration tests

2019-01-30 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756415#comment-16756415
 ] 

shane knapp commented on SPARK-26775:
-

ill install the latest version on my test worker and run a couple of builds to 
check for compatibility...  it should be fine...

...i hope.  :)

this is definitely out of scope for this issue btw and it won't be a gating 
factor for resolving this.

also, i launched a test for the new k8s version:
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/7593/

looks good so far!

> Update Jenkins nodes to support local volumes for K8s integration tests
> ---
>
> Key: SPARK-26775
> URL: https://issues.apache.org/jira/browse/SPARK-26775
> Project: Spark
>  Issue Type: Improvement
>  Components: jenkins, Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Assignee: shane knapp
>Priority: Major
>
> Current version of Minikube on test machines does not support properly the 
> local persistent volume feature required by this PR: 
> [https://github.com/apache/spark/pull/23514].
> We get his error:
> "spec.local: Forbidden: Local volumes are disabled by feature-gate, 
> metadata.annotations: Required value: Local volume requires node affinity"
> This is probably due to this: 
> [https://github.com/rancher/rancher/issues/13864] which implies that we need 
> to update to 1.10+ as described in 
> [https://kubernetes.io/docs/concepts/storage/volumes/#local]. Fabric8io 
> client is already updated in the PR mentioned at the beginning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26775) Update Jenkins nodes to support local volumes for K8s integration tests

2019-01-30 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756388#comment-16756388
 ] 

shane knapp commented on SPARK-26775:
-

all done!

{noformat}
-bash-4.1$ pssh -i -h ubuntu_workers.txt "minikube config set 
kubernetes-version v1.10.0"
[1] 10:24:44 [SUCCESS] amp-jenkins-staging-worker-02
[2] 10:24:44 [SUCCESS] amp-jenkins-staging-worker-01
[3] 10:24:44 [SUCCESS] research-jenkins-worker-07
[4] 10:24:44 [SUCCESS] research-jenkins-worker-08
{noformat}

...and for testing:

{noformat}
jenkins@ubuntu-testing:~$ minikube config set kubernetes-version v1.10.0
jenkins@ubuntu-testing:~$ minikube --vm-driver=kvm2 start --memory 6000 --cpus 8
Starting local Kubernetes v1.10.0 cluster...
Starting VM...
Getting VM IP address...
Moving files into cluster...
Downloading localkube binary
 173.54 MB / 173.54 MB [] 100.00% 0s
 0 B / 65 B [--]   0.00%
 65 B / 65 B [==] 100.00% 
0sSetting up certs...
Connecting to cluster...
Setting up kubeconfig...
Starting cluster components...
Kubectl is now configured to use the cluster.
Loading cached images from config file.
{noformat}


> Update Jenkins nodes to support local volumes for K8s integration tests
> ---
>
> Key: SPARK-26775
> URL: https://issues.apache.org/jira/browse/SPARK-26775
> Project: Spark
>  Issue Type: Improvement
>  Components: jenkins, Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Assignee: shane knapp
>Priority: Major
>
> Current version of Minikube on test machines does not support properly the 
> local persistent volume feature required by this PR: 
> [https://github.com/apache/spark/pull/23514].
> We get his error:
> "spec.local: Forbidden: Local volumes are disabled by feature-gate, 
> metadata.annotations: Required value: Local volume requires node affinity"
> This is probably due to this: 
> [https://github.com/rancher/rancher/issues/13864] which implies that we need 
> to update to 1.10+ as described in 
> [https://kubernetes.io/docs/concepts/storage/volumes/#local]. Fabric8io 
> client is already updated in the PR mentioned at the beginning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26775) Update Jenkins nodes to support local volumes for K8s integration tests

2019-01-30 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756399#comment-16756399
 ] 

Stavros Kontopoulos commented on SPARK-26775:
-

Yeah v0.25.0 is ancient history I guess :)

> Update Jenkins nodes to support local volumes for K8s integration tests
> ---
>
> Key: SPARK-26775
> URL: https://issues.apache.org/jira/browse/SPARK-26775
> Project: Spark
>  Issue Type: Improvement
>  Components: jenkins, Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Assignee: shane knapp
>Priority: Major
>
> Current version of Minikube on test machines does not support properly the 
> local persistent volume feature required by this PR: 
> [https://github.com/apache/spark/pull/23514].
> We get his error:
> "spec.local: Forbidden: Local volumes are disabled by feature-gate, 
> metadata.annotations: Required value: Local volume requires node affinity"
> This is probably due to this: 
> [https://github.com/rancher/rancher/issues/13864] which implies that we need 
> to update to 1.10+ as described in 
> [https://kubernetes.io/docs/concepts/storage/volumes/#local]. Fabric8io 
> client is already updated in the PR mentioned at the beginning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26775) Update Jenkins nodes to support local volumes for K8s integration tests

2019-01-30 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756390#comment-16756390
 ] 

shane knapp commented on SPARK-26775:
-

we're quite behind w/the minikube version however:  v0.25.0

most recent is v0.33.1

> Update Jenkins nodes to support local volumes for K8s integration tests
> ---
>
> Key: SPARK-26775
> URL: https://issues.apache.org/jira/browse/SPARK-26775
> Project: Spark
>  Issue Type: Improvement
>  Components: jenkins, Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Assignee: shane knapp
>Priority: Major
>
> Current version of Minikube on test machines does not support properly the 
> local persistent volume feature required by this PR: 
> [https://github.com/apache/spark/pull/23514].
> We get his error:
> "spec.local: Forbidden: Local volumes are disabled by feature-gate, 
> metadata.annotations: Required value: Local volume requires node affinity"
> This is probably due to this: 
> [https://github.com/rancher/rancher/issues/13864] which implies that we need 
> to update to 1.10+ as described in 
> [https://kubernetes.io/docs/concepts/storage/volumes/#local]. Fabric8io 
> client is already updated in the PR mentioned at the beginning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-30 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-26677:
--
Fix Version/s: 2.4.1

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.1
>
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26775) Update Jenkins nodes to support local volumes for K8s integration tests

2019-01-30 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756375#comment-16756375
 ] 

shane knapp commented on SPARK-26775:
-

okie dokie, i should get this sorted today.

> Update Jenkins nodes to support local volumes for K8s integration tests
> ---
>
> Key: SPARK-26775
> URL: https://issues.apache.org/jira/browse/SPARK-26775
> Project: Spark
>  Issue Type: Improvement
>  Components: jenkins, Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Assignee: shane knapp
>Priority: Major
>
> Current version of Minikube on test machines does not support properly the 
> local persistent volume feature required by this PR: 
> [https://github.com/apache/spark/pull/23514].
> We get his error:
> "spec.local: Forbidden: Local volumes are disabled by feature-gate, 
> metadata.annotations: Required value: Local volume requires node affinity"
> This is probably due to this: 
> [https://github.com/rancher/rancher/issues/13864] which implies that we need 
> to update to 1.10+ as described in 
> [https://kubernetes.io/docs/concepts/storage/volumes/#local]. Fabric8io 
> client is already updated in the PR mentioned at the beginning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26784) Allow running driver pod as provided user

2019-01-30 Thread Alexander Mukhopad (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Mukhopad updated SPARK-26784:
---
Priority: Major  (was: Minor)

> Allow running driver pod as provided user
> -
>
> Key: SPARK-26784
> URL: https://issues.apache.org/jira/browse/SPARK-26784
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Alexander Mukhopad
>Priority: Major
>
> Add possibility to override _Dockerfile_'s _USER_ directive by adding 
> option/s to spark-submit, specifying username:group and/or uid:guid
> https://github.com/apache/spark/blob/master/docs/running-on-kubernetes.md#user-identity



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26786) Handle to treat escaped newline characters('\r','\n') in spark csv

2019-01-30 Thread vishnuram selvaraj (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vishnuram selvaraj updated SPARK-26786:
---
Description: 
There are some systems like AWS redshift which writes csv files by escaping 
newline characters('\r','\n') in addition to escaping the quote characters, if 
they come as part of the data.

Redshift documentation 
link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and below 
is their mention of escaping requirements in the mentioned link

ESCAPE

For CHAR and VARCHAR columns in delimited unload files, an escape character 
(\{{}}) is placed before every occurrence of the following characters:
 * Linefeed: {{\n}}

 * Carriage return: {{\r}}

 * The delimiter character specified for the unloaded data.

 * The escape character: \{{}}

 * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are 
specified in the UNLOAD command).

 

*Problem statement:* 

But the spark CSV reader doesn't have a handle to treat/remove the escape 
characters infront of the newline characters in the data.

It would really help if we can add a feature to handle the escaped newline 
characters through another parameter like (escapeNewline = 'true/false').

*Example:*

Below are the details of my test data set up in a file.
 * The first record in that file has escaped windows newline character (
r
 n)
 * The third record in that file has escaped unix newline character (
 n)
 * The fifth record in that file has the escaped quote character (")

the file looks like below in vi editor:

 
{code:java}
"1","this is \^M\
line1"^M
"2","this is line2"^M
"3","this is \
line3"^M
"4","this is \" line4"^M
"5","this is line5"^M{code}
 

When I read the file in python's csv module with escape, it is able to remove 
the added escape characters as you can see below,

 
{code:java}
>>> with open('/tmp/test3.csv','r') as readCsv:
... readFile = 
csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False)
... for row in readFile:
... print(row)
...
['1', 'this is \r\n line1']
['2', 'this is line2']
['3', 'this is \n line3']
['4', 'this is " line4']
['5', 'this is line5']
{code}
But if I read the same file in spark-csv reader, the escape characters infront 
of the newline characters are not removed.But the escape before the (") is 
removed.
{code:java}
>>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false')
>>> redDf.show()
+---+--+
|_c0| _c1|
+---+--+
\ 1|this is \
line1|
| 2| this is line2|
| 3| this is \
line3|
| 4| this is " line4|
| 5| this is line5|
+---+--+
{code}
 

  was:
There are some systems like AWS redshift which writes csv files by escaping 
newline characters('\r','\n') in addition to escaping the quote characters, if 
they come as part of the data.

Redshift documentation 
link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and below 
is their mention of escaping requirements in the mentioned link

ESCAPE

For CHAR and VARCHAR columns in delimited unload files, an escape character 
({{\}}) is placed before every occurrence of the following characters:
 * Linefeed: {{\n}}

 * Carriage return: {{\r}}

 * The delimiter character specified for the unloaded data.

 * The escape character: {{\}}

 * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are 
specified in the UNLOAD command).

 

 

But the spark CSV reader doesn't have a handle to treat/remove the escape 
characters infront of the newline characters in the data.

It would really help if we can add a feature to handle the escaped newline 
characters through another parameter like (escapeNewline = 'true/false').

*Example:*

Below are the details of my test data set up in a file.
 * The first record in that file has escaped windows newline character (\\r
n)
 * The third record in that file has escaped unix newline character (
n)
 * The fifth record in that file has the escaped quote character (")

the file looks like below in vi editor:

 
{code:java}
"1","this is \^M\
line1"^M
"2","this is line2"^M
"3","this is \
line3"^M
"4","this is \" line4"^M
"5","this is line5"^M{code}
 

When I read the file in python's csv module with escape, it is able to remove 
the added escape characters as you can see below,

 
{code:java}
>>> with open('/tmp/test3.csv','r') as readCsv:
... readFile = 
csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False)
... for row in readFile:
... print(row)
...
['1', 'this is \r\n line1']
['2', 'this is line2']
['3', 'this is \n line3']
['4', 'this is " line4']
['5', 'this is line5']
{code}
But if I read the same file in spark-csv reader, the escape characters infront 
of the newline characters are 

[jira] [Updated] (SPARK-26786) Handle to treat escaped newline characters('\r','\n') in spark csv

2019-01-30 Thread vishnuram selvaraj (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vishnuram selvaraj updated SPARK-26786:
---
Description: 
There are some systems like AWS redshift which writes csv files by escaping 
newline characters('\r','\n') in addition to escaping the quote characters, if 
they come as part of the data.

Redshift documentation 
link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and below 
is their mention of escaping requirements in the mentioned link

ESCAPE

For CHAR and VARCHAR columns in delimited unload files, an escape character 
({{\}}) is placed before every occurrence of the following characters:
 * Linefeed: {{\n}}

 * Carriage return: {{\r}}

 * The delimiter character specified for the unloaded data.

 * The escape character: {{\}}

 * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are 
specified in the UNLOAD command).

 

 

But the spark CSV reader doesn't have a handle to treat/remove the escape 
characters infront of the newline characters in the data.

It would really help if we can add a feature to handle the escaped newline 
characters through another parameter like (escapeNewline = 'true/false').

*Example:*

Below are the details of my test data set up in a file.
 * The first record in that file has escaped windows newline character (\\r
n)
 * The third record in that file has escaped unix newline character (
n)
 * The fifth record in that file has the escaped quote character (")

the file looks like below in vi editor:

 
{code:java}
"1","this is \^M\
line1"^M
"2","this is line2"^M
"3","this is \
line3"^M
"4","this is \" line4"^M
"5","this is line5"^M{code}
 

When I read the file in python's csv module with escape, it is able to remove 
the added escape characters as you can see below,

 
{code:java}
>>> with open('/tmp/test3.csv','r') as readCsv:
... readFile = 
csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False)
... for row in readFile:
... print(row)
...
['1', 'this is \r\n line1']
['2', 'this is line2']
['3', 'this is \n line3']
['4', 'this is " line4']
['5', 'this is line5']
{code}
But if I read the same file in spark-csv reader, the escape characters infront 
of the newline characters are not removed.But the escape before the (") is 
removed.
{code:java}
>>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false')
>>> redDf.show()
+---+--+
|_c0| _c1|
+---+--+
\ 1|this is \
line1|
| 2| this is line2|
| 3| this is \
line3|
| 4| this is " line4|
| 5| this is line5|
+---+--+
{code}
 

  was:
There are some systems like AWS redshift which writes csv files by escaping 
newline characters('\r','\n') in addition to escaping the quote characters, if 
they come as part of the data.

But the spark CSV reader doesn't have a handle to treat/remove the escape 
characters infront of the newline characters in the data.

It would really help if we can add a feature to handle the escaped newline 
characters through another parameter like (escapeNewline = 'true/false').

*Example:*

Below are the details of my test data set up in a file.
 * The first record in that file has escaped windows newline character (\\r\\n)
 * The third record in that file has escaped unix newline character (\\n)
 * The fifth record in that file has the escaped quote character (")

the file looks like below in vi editor:

 
{code:java}
"1","this is \^M\
line1"^M
"2","this is line2"^M
"3","this is \
line3"^M
"4","this is \" line4"^M
"5","this is line5"^M{code}
 

When I read the file in python's csv module with escape, it is able to remove 
the added escape characters as you can see below,

 
{code:java}
>>> with open('/tmp/test3.csv','r') as readCsv:
... readFile = 
csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False)
... for row in readFile:
... print(row)
...
['1', 'this is \r\n line1']
['2', 'this is line2']
['3', 'this is \n line3']
['4', 'this is " line4']
['5', 'this is line5']
{code}


 But if I read the same file in spark-csv reader, the escape characters infront 
of the newline characters are not removed.But the escape before the (") is 
removed.
{code:java}
>>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false')
>>> redDf.show()
+---+--+
|_c0| _c1|
+---+--+
\ 1|this is \
line1|
| 2| this is line2|
| 3| this is \
line3|
| 4| this is " line4|
| 5| this is line5|
+---+--+
{code}
 


> Handle to treat escaped newline characters('\r','\n') in spark csv
> 

[jira] [Assigned] (SPARK-26775) Update Jenkins nodes to support local volumes for K8s integration tests

2019-01-30 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp reassigned SPARK-26775:
---

Assignee: shane knapp

> Update Jenkins nodes to support local volumes for K8s integration tests
> ---
>
> Key: SPARK-26775
> URL: https://issues.apache.org/jira/browse/SPARK-26775
> Project: Spark
>  Issue Type: Improvement
>  Components: jenkins, Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Assignee: shane knapp
>Priority: Major
>
> Current version of Minikube on test machines does not support properly the 
> local persistent volume feature required by this PR: 
> [https://github.com/apache/spark/pull/23514].
> We get his error:
> "spec.local: Forbidden: Local volumes are disabled by feature-gate, 
> metadata.annotations: Required value: Local volume requires node affinity"
> This is probably due to this: 
> [https://github.com/rancher/rancher/issues/13864] which implies that we need 
> to update to 1.10+ as described in 
> [https://kubernetes.io/docs/concepts/storage/volumes/#local]. Fabric8io 
> client is already updated in the PR mentioned at the beginning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26786) Handle to treat escaped newline characters('\r','\n') in spark csv

2019-01-30 Thread vishnuram selvaraj (JIRA)
vishnuram selvaraj created SPARK-26786:
--

 Summary: Handle to treat escaped newline characters('\r','\n') in 
spark csv
 Key: SPARK-26786
 URL: https://issues.apache.org/jira/browse/SPARK-26786
 Project: Spark
  Issue Type: New Feature
  Components: Input/Output
Affects Versions: 2.4.0
Reporter: vishnuram selvaraj


There are some systems like AWS redshift which writes csv files by escaping 
newline characters('\r','\n') in addition to escaping the quote characters, if 
they come as part of the data.

But the spark CSV reader doesn't have a handle to treat/remove the escape 
characters infront of the newline characters in the data.

It would really help if we can add a feature to handle the escaped newline 
characters through another parameter like (escapeNewline = 'true/false').

*Example:*

Below are the details of my test data set up in a file.
 * The first record in that file has escaped windows newline character (\\r\\n)
 * The third record in that file has escaped unix newline character (\\n)
 * The fifth record in that file has the escaped quote character (")

the file looks like below in vi editor:

 
{code:java}
"1","this is \^M\
line1"^M
"2","this is line2"^M
"3","this is \
line3"^M
"4","this is \" line4"^M
"5","this is line5"^M{code}
 

When I read the file in python's csv module with escape, it is able to remove 
the added escape characters as you can see below,

 
{code:java}
>>> with open('/tmp/test3.csv','r') as readCsv:
... readFile = 
csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False)
... for row in readFile:
... print(row)
...
['1', 'this is \r\n line1']
['2', 'this is line2']
['3', 'this is \n line3']
['4', 'this is " line4']
['5', 'this is line5']
{code}


 But if I read the same file in spark-csv reader, the escape characters infront 
of the newline characters are not removed.But the escape before the (") is 
removed.
{code:java}
>>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false')
>>> redDf.show()
+---+--+
|_c0| _c1|
+---+--+
\ 1|this is \
line1|
| 2| this is line2|
| 3| this is \
line3|
| 4| this is " line4|
| 5| this is line5|
+---+--+
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26785) data source v2 API refactor: streaming write

2019-01-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26785:


Assignee: Apache Spark  (was: Wenchen Fan)

> data source v2 API refactor: streaming write
> 
>
> Key: SPARK-26785
> URL: https://issues.apache.org/jira/browse/SPARK-26785
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26785) data source v2 API refactor: streaming write

2019-01-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26785:


Assignee: Wenchen Fan  (was: Apache Spark)

> data source v2 API refactor: streaming write
> 
>
> Key: SPARK-26785
> URL: https://issues.apache.org/jira/browse/SPARK-26785
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26741) Analyzer incorrectly resolves aggregate function outside of Aggregate operators

2019-01-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26741:


Assignee: (was: Apache Spark)

> Analyzer incorrectly resolves aggregate function outside of Aggregate 
> operators
> ---
>
> Key: SPARK-26741
> URL: https://issues.apache.org/jira/browse/SPARK-26741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kris Mok
>Priority: Major
>
> The analyzer can sometimes hit issues with resolving functions. e.g.
> {code:sql}
> select max(id)
> from range(10)
> group by id
> having count(1) >= 1
> order by max(id)
> {code}
> The analyzed plan of this query is:
> {code:none}
> == Analyzed Logical Plan ==
> max(id): bigint
> Project [max(id)#91L]
> +- Sort [max(id#88L) ASC NULLS FIRST], true
>+- Project [max(id)#91L, id#88L]
>   +- Filter (count(1)#93L >= cast(1 as bigint))
>  +- Aggregate [id#88L], [max(id#88L) AS max(id)#91L, count(1) AS 
> count(1)#93L, id#88L]
> +- Range (0, 10, step=1, splits=None)
> {code}
> Note how an aggregate function is outside of {{Aggregate}} operators in the 
> fully analyzed plan:
> {{Sort [max(id#88L) ASC NULLS FIRST], true}}, which makes the plan invalid.
> Trying to run this query will lead to weird issues in codegen, but the root 
> cause is in the analyzer:
> {code:none}
> java.lang.UnsupportedOperationException: Cannot generate code for expression: 
> max(input[1, bigint, false])
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291)
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.$anonfun$createOrderKeys$1(GenerateOrdering.scala:82)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:237)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.createOrderKeys(GenerateOrdering.scala:82)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.genComparisons(GenerateOrdering.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:152)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1194)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.(GenerateOrdering.scala:195)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.(GenerateOrdering.scala:192)
>   at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:153)
>   at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3302)
>   at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2470)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3291)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3287)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2470)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2684)
>   at org.apache.spark.sql.Dataset.getRows(Dataset.scala:262)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:299)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:753)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:712)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:721)
> {code}
> The test case {{SPARK-23957 Remove redundant sort from subquery plan(scalar 
> subquery)}} in {{SubquerySuite}} has been disabled 

[jira] [Created] (SPARK-26785) data source v2 API refactor: streaming write

2019-01-30 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-26785:
---

 Summary: data source v2 API refactor: streaming write
 Key: SPARK-26785
 URL: https://issues.apache.org/jira/browse/SPARK-26785
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26741) Analyzer incorrectly resolves aggregate function outside of Aggregate operators

2019-01-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26741:


Assignee: Apache Spark

> Analyzer incorrectly resolves aggregate function outside of Aggregate 
> operators
> ---
>
> Key: SPARK-26741
> URL: https://issues.apache.org/jira/browse/SPARK-26741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kris Mok
>Assignee: Apache Spark
>Priority: Major
>
> The analyzer can sometimes hit issues with resolving functions. e.g.
> {code:sql}
> select max(id)
> from range(10)
> group by id
> having count(1) >= 1
> order by max(id)
> {code}
> The analyzed plan of this query is:
> {code:none}
> == Analyzed Logical Plan ==
> max(id): bigint
> Project [max(id)#91L]
> +- Sort [max(id#88L) ASC NULLS FIRST], true
>+- Project [max(id)#91L, id#88L]
>   +- Filter (count(1)#93L >= cast(1 as bigint))
>  +- Aggregate [id#88L], [max(id#88L) AS max(id)#91L, count(1) AS 
> count(1)#93L, id#88L]
> +- Range (0, 10, step=1, splits=None)
> {code}
> Note how an aggregate function is outside of {{Aggregate}} operators in the 
> fully analyzed plan:
> {{Sort [max(id#88L) ASC NULLS FIRST], true}}, which makes the plan invalid.
> Trying to run this query will lead to weird issues in codegen, but the root 
> cause is in the analyzer:
> {code:none}
> java.lang.UnsupportedOperationException: Cannot generate code for expression: 
> max(input[1, bigint, false])
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291)
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.$anonfun$createOrderKeys$1(GenerateOrdering.scala:82)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:237)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.createOrderKeys(GenerateOrdering.scala:82)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.genComparisons(GenerateOrdering.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:152)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1194)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.(GenerateOrdering.scala:195)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.(GenerateOrdering.scala:192)
>   at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:153)
>   at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3302)
>   at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2470)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3291)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3287)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2470)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2684)
>   at org.apache.spark.sql.Dataset.getRows(Dataset.scala:262)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:299)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:753)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:712)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:721)
> {code}
> The test case {{SPARK-23957 Remove redundant sort from subquery plan(scalar 
> subquery)}} in 

[jira] [Commented] (SPARK-26176) Verify column name when creating table via `STORED AS`

2019-01-30 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756245#comment-16756245
 ] 

kevin yu commented on SPARK-26176:
--

Hi Mikhail:
Sorry for the delay, yes, I am still looking into it.

Kevin

> Verify column name when creating table via `STORED AS`
> --
>
> Key: SPARK-26176
> URL: https://issues.apache.org/jira/browse/SPARK-26176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> We can issue a reasonable exception when we creating Parquet native tables, 
> {code:java}
> CREATE TABLE TAB1TEST USING PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> org.apache.spark.sql.AnalysisException: Attribute name "count(ID)" contains 
> invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
> {code}
> However, the error messages are misleading when we create a table using the 
> Hive serde "STORED AS"
> {code:java}
> CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> 18/11/26 09:04:44 ERROR SparkSQLDriver: Failed in [CREATE TABLE TAB2TEST 
> stored as parquet AS SELECT COUNT(col1) FROM TAB1]
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile(SaveAsHiveFile.scala:97)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile$(SaveAsHiveFile.scala:48)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:66)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:201)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
>   at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:86)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:113)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:201)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3270)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3266)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:201)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:86)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:655)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:685)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.spark.SparkException: Job aborted due to 

[jira] [Commented] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks

2019-01-30 Thread Sanket Reddy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756226#comment-16756226
 ] 

Sanket Reddy commented on SPARK-25692:
--

Created a pr [https://github.com/apache/spark/pull/23700] plz take a look 
thanks and let me know your thoughts

> Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Fix For: 3.0.0
>
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot 
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks

2019-01-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25692:


Assignee: Apache Spark

> Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Blocker
> Fix For: 3.0.0
>
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot 
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks

2019-01-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25692:


Assignee: (was: Apache Spark)

> Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Fix For: 3.0.0
>
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot 
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26732) Flaky test: SparkContextInfoSuite.getRDDStorageInfo only reports on RDDs that actually persist data

2019-01-30 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-26732.
--
   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 3.0.0
   2.4.1
   2.3.3

Resolved by https://github.com/apache/spark/pull/23654

> Flaky test: SparkContextInfoSuite.getRDDStorageInfo only reports on RDDs that 
> actually persist data
> ---
>
> Key: SPARK-26732
> URL: https://issues.apache.org/jira/browse/SPARK-26732
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.2, 2.4.0, 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> From 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/5437/testReport/junit/org.apache.spark/SparkContextInfoSuite/getRDDStorageInfo_only_reports_on_RDDs_that_actually_persist_data/:
> {noformat}
> Error Message
> org.scalatest.exceptions.TestFailedException: 0 did not equal 1
> Stacktrace
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 0 did 
> not equal 1
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.SparkContextInfoSuite.$anonfun$new$3(SparkContextInfoSuite.scala:63)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks

2019-01-30 Thread Sanket Reddy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756210#comment-16756210
 ] 

Sanket Reddy commented on SPARK-25692:
--

Did some further digging

How to reproduce
./build/mvn test 
-Dtest=org.apache.spark.network.RequestTimeoutIntegrationSuite,org.apache.spark.network.ChunkFetchIntegrationSuite
 -DwildcardSuites=None test
furtherRequestsDelay Test within RequestTimeoutIntegrationSuite was holding 
onto worker references. The test does close the server context but since the 
threads are global and there is sleep of 60 secs to fetch a specific chunk 
within this test, it grabs on it and waits for the client to consume but 
however the test is testing for a request timeout and it times out after 10 
secs, so the workers are just waiting there for the buffer to be consumed as 
per my understanding. I think we dont need this to be static as the server just 
initializes the TransportContext object once. I did some manual tests and it 
looks good

> Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Fix For: 3.0.0
>
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot 
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26765) Avro: Validate input and output schema

2019-01-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756178#comment-16756178
 ] 

Apache Spark commented on SPARK-26765:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/23699

> Avro: Validate input and output schema
> --
>
> Key: SPARK-26765
> URL: https://issues.apache.org/jira/browse/SPARK-26765
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> The API supportDataType in FileFormat helps to validate the output/input 
> schema before exection starts. So that we can avoid some invalid data source 
> IO, and users can see clean error messages.
> This PR is to override the validation API in Avro data source.
> Also, as per the spec of Avro(https://avro.apache.org/docs/1.8.2/spec.html), 
> NullType is supported. This PR fixes the handling of NullType.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26784) Allow running driver pod as provided user

2019-01-30 Thread Alexander Mukhopad (JIRA)
Alexander Mukhopad created SPARK-26784:
--

 Summary: Allow running driver pod as provided user
 Key: SPARK-26784
 URL: https://issues.apache.org/jira/browse/SPARK-26784
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Alexander Mukhopad


Add possibility to override _Dockerfile_'s _USER_ directive by adding option/s 
to spark-submit, specifying username:group and/or uid:guid

https://github.com/apache/spark/blob/master/docs/running-on-kubernetes.md#user-identity



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26766) Remove the list of filesystems from HadoopDelegationTokenProvider.obtainDelegationTokens

2019-01-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26766:


Assignee: (was: Apache Spark)

> Remove the list of filesystems from 
> HadoopDelegationTokenProvider.obtainDelegationTokens
> 
>
> Key: SPARK-26766
> URL: https://issues.apache.org/jira/browse/SPARK-26766
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> This was discussed in previous PR 
> [here|https://github.com/apache/spark/pull/23499/files#diff-406f99efa37915001b613de47815e25cR54].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26766) Remove the list of filesystems from HadoopDelegationTokenProvider.obtainDelegationTokens

2019-01-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26766:


Assignee: Apache Spark

> Remove the list of filesystems from 
> HadoopDelegationTokenProvider.obtainDelegationTokens
> 
>
> Key: SPARK-26766
> URL: https://issues.apache.org/jira/browse/SPARK-26766
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Minor
>
> This was discussed in previous PR 
> [here|https://github.com/apache/spark/pull/23499/files#diff-406f99efa37915001b613de47815e25cR54].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26783) Kafka parameter documentation doesn't match with the reality (upper/lowercase)

2019-01-30 Thread Gabor Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-26783:
--
Priority: Minor  (was: Major)

> Kafka parameter documentation doesn't match with the reality (upper/lowercase)
> --
>
> Key: SPARK-26783
> URL: https://issues.apache.org/jira/browse/SPARK-26783
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> A good example for this is "failOnDataLoss" which is reported in SPARK-23685. 
> I've just checked and there are several other parameters which suffer from 
> the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23685) Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2019-01-30 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756104#comment-16756104
 ] 

Gabor Somogyi commented on SPARK-23685:
---

Filed SPARK-26783.

> Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive 
> Offsets (i.e. Log Compaction)
> -
>
> Key: SPARK-23685
> URL: https://issues.apache.org/jira/browse/SPARK-23685
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: sirisha
>Priority: Major
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always 
> be just an increment of 1 .If not, it throws the below exception:
>  
> "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:). 
> Some data may have been lost because they are not available in Kafka any 
> more; either the data was aged out by Kafka or the topic may have been 
> deleted before all the data in the topic was processed. If you don't want 
> your streaming query to fail on such cases, set the source option 
> "failOnDataLoss" to "false". "
>  
> FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26783) Kafka parameter documentation doesn't match with the reality (upper/lowercase)

2019-01-30 Thread Gabor Somogyi (JIRA)
Gabor Somogyi created SPARK-26783:
-

 Summary: Kafka parameter documentation doesn't match with the 
reality (upper/lowercase)
 Key: SPARK-26783
 URL: https://issues.apache.org/jira/browse/SPARK-26783
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Gabor Somogyi


A good example for this is "failOnDataLoss" which is reported in SPARK-23685. 
I've just checked and there are several other parameters which suffer from the 
same issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23685) Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2019-01-30 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756077#comment-16756077
 ] 

Gabor Somogyi edited comment on SPARK-23685 at 1/30/19 1:19 PM:


Comment from [~sindiri] on the PR:
{quote}Originally this pr was created as "failOnDataLoss" doesn't have any 
impact when set in structured streaming. But found out that ,the variable that 
needs to be used is "failondataloss" (all in lower case).
This is not properly documented in Spark documentations. Hence, closing the pr 
. Thanks{quote}
File a PR to fix the upper/lowercase things.


was (Author: gsomogyi):
Comment from [~sindiri] on the PR:
{quote}Originally this pr was created as "failOnDataLoss" doesn't have any 
impact when set in structured streaming. But found out that ,the variable that 
needs to be used is "failondataloss" (all in lower case).
This is not properly documented in Spark documentations. Hence, closing the pr 
. Thanks{quote}
Closing the jira.

> Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive 
> Offsets (i.e. Log Compaction)
> -
>
> Key: SPARK-23685
> URL: https://issues.apache.org/jira/browse/SPARK-23685
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: sirisha
>Priority: Major
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always 
> be just an increment of 1 .If not, it throws the below exception:
>  
> "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:). 
> Some data may have been lost because they are not available in Kafka any 
> more; either the data was aged out by Kafka or the topic may have been 
> deleted before all the data in the topic was processed. If you don't want 
> your streaming query to fail on such cases, set the source option 
> "failOnDataLoss" to "false". "
>  
> FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23685) Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2019-01-30 Thread Gabor Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-23685.
---
Resolution: Information Provided

> Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive 
> Offsets (i.e. Log Compaction)
> -
>
> Key: SPARK-23685
> URL: https://issues.apache.org/jira/browse/SPARK-23685
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: sirisha
>Priority: Major
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always 
> be just an increment of 1 .If not, it throws the below exception:
>  
> "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:). 
> Some data may have been lost because they are not available in Kafka any 
> more; either the data was aged out by Kafka or the topic may have been 
> deleted before all the data in the topic was processed. If you don't want 
> your streaming query to fail on such cases, set the source option 
> "failOnDataLoss" to "false". "
>  
> FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23685) Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2019-01-30 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756077#comment-16756077
 ] 

Gabor Somogyi edited comment on SPARK-23685 at 1/30/19 1:22 PM:


Comment from [~sindiri] on the PR:
{quote}Originally this pr was created as "failOnDataLoss" doesn't have any 
impact when set in structured streaming. But found out that ,the variable that 
needs to be used is "failondataloss" (all in lower case).
This is not properly documented in Spark documentations. Hence, closing the pr 
. Thanks{quote}
Closing the jira because will file a new one to handle several similar configs 
in one PR.


was (Author: gsomogyi):
Comment from [~sindiri] on the PR:
{quote}Originally this pr was created as "failOnDataLoss" doesn't have any 
impact when set in structured streaming. But found out that ,the variable that 
needs to be used is "failondataloss" (all in lower case).
This is not properly documented in Spark documentations. Hence, closing the pr 
. Thanks{quote}
File a PR to fix the upper/lowercase things.

> Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive 
> Offsets (i.e. Log Compaction)
> -
>
> Key: SPARK-23685
> URL: https://issues.apache.org/jira/browse/SPARK-23685
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: sirisha
>Priority: Major
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always 
> be just an increment of 1 .If not, it throws the below exception:
>  
> "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:). 
> Some data may have been lost because they are not available in Kafka any 
> more; either the data was aged out by Kafka or the topic may have been 
> deleted before all the data in the topic was processed. If you don't want 
> your streaming query to fail on such cases, set the source option 
> "failOnDataLoss" to "false". "
>  
> FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26758) Idle Executors are not getting killed after spark.dynamicAllocation.executorIdleTimeout value

2019-01-30 Thread sandeep katta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756070#comment-16756070
 ] 

sandeep katta commented on SPARK-26758:
---

I am able to reproduce this issue and soon will be providing patch for this

Driver Logs


2019-01-30 14:16:39,134 | INFO | spark-dynamic-executor-allocation | *Request 
to remove executorIds: 2, 1* | 
org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
2019-01-30 14:16:39,135 | DEBUG | spark-dynamic-executor-allocation | *Not 
removing idle executor 2 because there are only 3 executor(s) left* (number of 
executor target 3) | 
org.apache.spark.internal.Logging$class.logDebug(Logging.scala:58)
2019-01-30 14:16:39,135 | DEBUG | spark-dynamic-executor-allocation | Not 
removing idle executor 2 because there are only 3 executor(s) left (number of 
executor target 3) | 
org.apache.spark.internal.Logging$class.logDebug(Logging.scala:58)
2019-01-30 14:16:39,135 | DEBUG | spark-dynamic-executor-allocation | Not 
removing idle executor 1 because there are only 3 executor(s) left (number of 
executor target 3) | 
org.apache.spark.internal.Logging$class.logDebug(Logging.scala:58)
2019-01-30 14:16:39,135 | DEBUG | spark-dynamic-executor-allocation | Not 
removing idle executor 1 because there are only 3 executor(s) left (number of 
executor target 3) | 
org.apache.spark.internal.Logging$class.logDebug(Logging.scala:58)
2019-01-30 14:16:39,241 | DEBUG | spark-dynamic-executor-allocation | Lowering 
target number of executors to 0 (previously 3) because not all requested 
executors are actually needed | 
org.apache.spark.internal.Logging$class.logDebug(Logging.scala:58)
2019-01-30 14:16:39,241 | DEBUG | spark-dynamic-executor-allocation | Lowering 
target number of executors to 0 (previously 3) because not all requested 
executors are actually needed | 
org.apache.spark.internal.Logging$class.logDebug(Logging.scala:58)

> Idle Executors are not getting killed after 
> spark.dynamicAllocation.executorIdleTimeout value
> -
>
> Key: SPARK-26758
> URL: https://issues.apache.org/jira/browse/SPARK-26758
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.0
> Environment: Spark Version:2.4
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: SPARK-26758.png
>
>
> Steps:
> 1. Submit Spark shell with below initial Executor 3, minimum Executor=0 and 
> executorIdleTimeout=60s
> {code}
> bin/spark-shell --master yarn --conf spark.dynamicAllocation.enabled=true \
>   --conf spark.dynamicAllocation.initialExecutors=3 \
>   --conf spark.dynamicAllocation.minExecutors=0 \
>   --conf spark.dynamicAllocation.executorIdleTimeout=60s
> {code}
> 2. Launch Spark UI and check under Executor Tab
> Observation:
> Initial 3 Executors assigned. After 60s( executorIdleTimeout) , number of 
> active executor remains same.
> Expected:
> Apart from AM container, all other executors should be dead.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23685) Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2019-01-30 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756077#comment-16756077
 ] 

Gabor Somogyi commented on SPARK-23685:
---

Comment from [~sindiri] on the PR:
{quote}Originally this pr was created as "failOnDataLoss" doesn't have any 
impact when set in structured streaming. But found out that ,the variable that 
needs to be used is "failondataloss" (all in lower case).
This is not properly documented in Spark documentations. Hence, closing the pr 
. Thanks{quote}
Closing the jira.

> Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive 
> Offsets (i.e. Log Compaction)
> -
>
> Key: SPARK-23685
> URL: https://issues.apache.org/jira/browse/SPARK-23685
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: sirisha
>Priority: Major
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always 
> be just an increment of 1 .If not, it throws the below exception:
>  
> "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:). 
> Some data may have been lost because they are not available in Kafka any 
> more; either the data was aged out by Kafka or the topic may have been 
> deleted before all the data in the topic was processed. If you don't want 
> your streaming query to fail on such cases, set the source option 
> "failOnDataLoss" to "false". "
>  
> FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26758) Idle Executors are not getting killed after spark.dynamicAllocation.executorIdleTimeout value

2019-01-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26758:


Assignee: (was: Apache Spark)

> Idle Executors are not getting killed after 
> spark.dynamicAllocation.executorIdleTimeout value
> -
>
> Key: SPARK-26758
> URL: https://issues.apache.org/jira/browse/SPARK-26758
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.0
> Environment: Spark Version:2.4
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: SPARK-26758.png
>
>
> Steps:
> 1. Submit Spark shell with below initial Executor 3, minimum Executor=0 and 
> executorIdleTimeout=60s
> {code}
> bin/spark-shell --master yarn --conf spark.dynamicAllocation.enabled=true \
>   --conf spark.dynamicAllocation.initialExecutors=3 \
>   --conf spark.dynamicAllocation.minExecutors=0 \
>   --conf spark.dynamicAllocation.executorIdleTimeout=60s
> {code}
> 2. Launch Spark UI and check under Executor Tab
> Observation:
> Initial 3 Executors assigned. After 60s( executorIdleTimeout) , number of 
> active executor remains same.
> Expected:
> Apart from AM container, all other executors should be dead.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26758) Idle Executors are not getting killed after spark.dynamicAllocation.executorIdleTimeout value

2019-01-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26758:


Assignee: Apache Spark

> Idle Executors are not getting killed after 
> spark.dynamicAllocation.executorIdleTimeout value
> -
>
> Key: SPARK-26758
> URL: https://issues.apache.org/jira/browse/SPARK-26758
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.0
> Environment: Spark Version:2.4
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Apache Spark
>Priority: Major
> Attachments: SPARK-26758.png
>
>
> Steps:
> 1. Submit Spark shell with below initial Executor 3, minimum Executor=0 and 
> executorIdleTimeout=60s
> {code}
> bin/spark-shell --master yarn --conf spark.dynamicAllocation.enabled=true \
>   --conf spark.dynamicAllocation.initialExecutors=3 \
>   --conf spark.dynamicAllocation.minExecutors=0 \
>   --conf spark.dynamicAllocation.executorIdleTimeout=60s
> {code}
> 2. Launch Spark UI and check under Executor Tab
> Observation:
> Initial 3 Executors assigned. After 60s( executorIdleTimeout) , number of 
> active executor remains same.
> Expected:
> Apart from AM container, all other executors should be dead.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26176) Verify column name when creating table via `STORED AS`

2019-01-30 Thread Mikhail (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756002#comment-16756002
 ] 

Mikhail commented on SPARK-26176:
-

Hello [~kevinyu98]
Are you still looking into it?

> Verify column name when creating table via `STORED AS`
> --
>
> Key: SPARK-26176
> URL: https://issues.apache.org/jira/browse/SPARK-26176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> We can issue a reasonable exception when we creating Parquet native tables, 
> {code:java}
> CREATE TABLE TAB1TEST USING PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> org.apache.spark.sql.AnalysisException: Attribute name "count(ID)" contains 
> invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
> {code}
> However, the error messages are misleading when we create a table using the 
> Hive serde "STORED AS"
> {code:java}
> CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> 18/11/26 09:04:44 ERROR SparkSQLDriver: Failed in [CREATE TABLE TAB2TEST 
> stored as parquet AS SELECT COUNT(col1) FROM TAB1]
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile(SaveAsHiveFile.scala:97)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile$(SaveAsHiveFile.scala:48)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:66)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:201)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
>   at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:86)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:113)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:201)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3270)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3266)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:201)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:86)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:655)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:685)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 

[jira] [Resolved] (SPARK-26782) Wrong column resolved when joining twice with the same dataframe

2019-01-30 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-26782.
-
Resolution: Duplicate

> Wrong column resolved when joining twice with the same dataframe
> 
>
> Key: SPARK-26782
> URL: https://issues.apache.org/jira/browse/SPARK-26782
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Vladimir Prus
>Priority: Major
>
> # Execute the following code:
>  
> {code:java}
> {
>  val events = Seq(("a", 0)).toDF("id", "ts")
>  val dim = Seq(("a", 0, 24), ("a", 24, 48)).toDF("id", "start", "end")
>  
>  val dimOriginal = dim.as("dim")
>  val dimShifted = dim.as("dimShifted")
> val r = events
>  .join(dimOriginal, "id")
>  .where(dimOriginal("start") <= $"ts" && $"ts" < dimOriginal("end"))
> val r2 = r 
>  .join(dimShifted, "id")
>  .where(dimShifted("start") <= $"ts" + 24 && $"ts" + 24 < dimShifted("end"))
>  
>  r2.show() 
>  r2.explain(true)
> }
> {code}
>  
>  # Expected effect:
>  ** One row is shown
>  ** Logical plan shows two independent joints with "dim" and "dimShifted"
>  # Observed effect:
>  ** No rows are printed.
>  ** Logical plan shows two filters are applied:
>  *** 'Filter ((start#17 <= ('ts + 24)) && (('ts + 24) < end#18))'
>  *** Filter ((start#17 <= ts#6) && (ts#6 < end#18))
>  ** Both these filters refer to the same start#17 and start#18 columns, so 
> they are applied to the same dataframe, not two different ones.
>  ** It appears that dimShifted("start") is resolved to be identical to 
> dimOriginal("start")
>  # I get the desired effect if I replace the second where with 
> {code:java}
> .where($"dimShifted.start" <= $"ts" + 24 && $"ts" + 24 < $"dimShifted.end")
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26782) Wrong column resolved when joining twice with the same dataframe

2019-01-30 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755987#comment-16755987
 ] 

Marco Gaido commented on SPARK-26782:
-

This is a duplicate of many others. I also started a thread on the dev mailing 
list regarding this problem. Let me close this as a duplicate.

> Wrong column resolved when joining twice with the same dataframe
> 
>
> Key: SPARK-26782
> URL: https://issues.apache.org/jira/browse/SPARK-26782
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Vladimir Prus
>Priority: Major
>
> # Execute the following code:
>  
> {code:java}
> {
>  val events = Seq(("a", 0)).toDF("id", "ts")
>  val dim = Seq(("a", 0, 24), ("a", 24, 48)).toDF("id", "start", "end")
>  
>  val dimOriginal = dim.as("dim")
>  val dimShifted = dim.as("dimShifted")
> val r = events
>  .join(dimOriginal, "id")
>  .where(dimOriginal("start") <= $"ts" && $"ts" < dimOriginal("end"))
> val r2 = r 
>  .join(dimShifted, "id")
>  .where(dimShifted("start") <= $"ts" + 24 && $"ts" + 24 < dimShifted("end"))
>  
>  r2.show() 
>  r2.explain(true)
> }
> {code}
>  
>  # Expected effect:
>  ** One row is shown
>  ** Logical plan shows two independent joints with "dim" and "dimShifted"
>  # Observed effect:
>  ** No rows are printed.
>  ** Logical plan shows two filters are applied:
>  *** 'Filter ((start#17 <= ('ts + 24)) && (('ts + 24) < end#18))'
>  *** Filter ((start#17 <= ts#6) && (ts#6 < end#18))
>  ** Both these filters refer to the same start#17 and start#18 columns, so 
> they are applied to the same dataframe, not two different ones.
>  ** It appears that dimShifted("start") is resolved to be identical to 
> dimOriginal("start")
>  # I get the desired effect if I replace the second where with 
> {code:java}
> .where($"dimShifted.start" <= $"ts" + 24 && $"ts" + 24 < $"dimShifted.end")
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26782) Wrong column resolved when joining twice with the same dataframe

2019-01-30 Thread Vladimir Prus (JIRA)
Vladimir Prus created SPARK-26782:
-

 Summary: Wrong column resolved when joining twice with the same 
dataframe
 Key: SPARK-26782
 URL: https://issues.apache.org/jira/browse/SPARK-26782
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Vladimir Prus


# Execute the following code:

 
{code:java}
{
 val events = Seq(("a", 0)).toDF("id", "ts")
 val dim = Seq(("a", 0, 24), ("a", 24, 48)).toDF("id", "start", "end")
 
 val dimOriginal = dim.as("dim")
 val dimShifted = dim.as("dimShifted")
val r = events
 .join(dimOriginal, "id")
 .where(dimOriginal("start") <= $"ts" && $"ts" < dimOriginal("end"))
val r2 = r 
 .join(dimShifted, "id")
 .where(dimShifted("start") <= $"ts" + 24 && $"ts" + 24 < dimShifted("end"))
 
 r2.show() 
 r2.explain(true)
}
{code}
 

 # Expected effect:
 ** One row is shown
 ** Logical plan shows two independent joints with "dim" and "dimShifted"
 # Observed effect:
 ** No rows are printed.
 ** Logical plan shows two filters are applied:
 *** 'Filter ((start#17 <= ('ts + 24)) && (('ts + 24) < end#18))'
 *** Filter ((start#17 <= ts#6) && (ts#6 < end#18))
 ** Both these filters refer to the same start#17 and start#18 columns, so they 
are applied to the same dataframe, not two different ones.
 ** It appears that dimShifted("start") is resolved to be identical to 
dimOriginal("start")
 # I get the desired effect if I replace the second where with 

{code:java}
.where($"dimShifted.start" <= $"ts" + 24 && $"ts" + 24 < $"dimShifted.end")
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26777) SQL worked in 2.3.2 and fails in 2.4.0

2019-01-30 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755947#comment-16755947
 ] 

Hyukjin Kwon commented on SPARK-26777:
--

Please provide a _minimised_, _self-runninable_ reproducers if possible rather 
than making people putting duplicated efforts. This just sounds like requesting 
investigation.

> SQL worked in 2.3.2 and fails in 2.4.0
> --
>
> Key: SPARK-26777
> URL: https://issues.apache.org/jira/browse/SPARK-26777
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Yuri Budilov
>Priority: Major
>
> Following SQL worked in Spark 2.3.2 and now fails on 2.4.0 (AWS EMR Spark)
>  PySpark call below:
> spark.sql("select partition_year_utc,partition_month_utc,partition_day_utc \
> from datalake_reporting.copy_of_leads_notification \
> where partition_year_utc = (select max(partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification) \
> and partition_month_utc = \
>  (select max(partition_month_utc) from 
> datalake_reporting.copy_of_leads_notification as m \
>  where \
>  m.partition_year_utc = (select max(partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification)) \
>  and partition_day_utc = (select max(d.partition_day_utc) from 
> datalake_reporting.copy_of_leads_notification as d \
>  where d.partition_month_utc = \
>  (select max(m1.partition_month_utc) from 
> datalake_reporting.copy_of_leads_notification as m1 \
>  where m1.partition_year_utc = \
>  (select max(y.partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification as y) \
>  ) \
>  ) \
>  order by 1 desc, 2 desc, 3 desc limit 1 ").show(1,False)
> Error: (no need for data, this is syntax).
> py4j.protocol.Py4JJavaError: An error occurred while calling o1326.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#4495 []
>  
> Note: all 3 columns in query are Partitioned columns - see bottom of the 
> schema)
>  
> Hive EMR AWS Schema is:
>  
> CREATE EXTERNAL TABLE `copy_of_leads_notification`(
> `message.environment.siteorigin` string, `dcpheader.dcploaddateutc` string, 
> `message.id` int, `source.properties._country` string, `message.created` 
> string, `dcpheader.generatedmessageid` string, `message.tags` bigint, 
> `source.properties._enqueuedtimeutc` string, `source.properties._leadtype` 
> string, `message.itemid` string, `message.prospect.postcode` string, 
> `message.prospect.email` string, `message.referenceid` string, 
> `message.item.year` string, `message.identifier` string, 
> `dcpheader.dcploadmonthutc` string, `message.processed` string, 
> `source.properties._tenant` string, `message.item.price` string, 
> `message.subscription.confirmresponse` boolean, `message.itemtype` string, 
> `message.prospect.lastname` string, `message.subscription.insurancequote` 
> boolean, `source.exchangename` string, 
> `message.prospect.identificationnumbers` bigint, 
> `message.environment.ipaddress` string, `dcpheader.dcploaddayutc` string, 
> `source.properties._itemtype` string, `source.properties._requesttype` 
> string, `message.item.make` string, `message.prospect.firstname` string, 
> `message.subscription.survey` boolean, `message.prospect.homephone` string, 
> `message.extendedproperties` bigint, `message.subscription.financequote` 
> boolean, `message.uniqueidentifier` string, `source.properties._id` string, 
> `dcpheader.sourcemessageguid` string, `message.requesttype` string, 
> `source.routingkey` string, `message.service` string, `message.item.model` 
> string, `message.environment.pagesource` string, `source.source` string, 
> `message.sellerid` string, `partition_date_utc` string, 
> `message.selleridentifier` string, `message.subscription.newsletter` boolean, 
> `dcpheader.dcploadyearutc` string, `message.leadtype` string, 
> `message.history` bigint, `message.callconnect.calloutcome` string, 
> `message.callconnect.datecreatedutc` string, 
> `message.callconnect.callrecordingurl` string, 
> `message.callconnect.transferoutcome` string, 
> `message.callconnect.hiderecording` boolean, 
> `message.callconnect.callstartutc` string, `message.callconnect.code` string, 
> `message.callconnect.callduration` string, `message.fraudnetinfo` string, 
> `message.callconnect.answernumber` string, `message.environment.sourcedevice` 
> string, `message.comments` string, `message.fraudinfo.servervariables` 
> bigint, `message.callconnect.servicenumber` string, 
> `message.callconnect.callid` string, `message.callconnect.voicemailurl` 
> string, `message.item.stocknumber` string, 
> `message.callconnect.answerduration` string, `message.callconnect.callendutc` 
> string, `message.item.series` string, `message.item.detailsurl` string, 
> `message.item.pricetype` 

[jira] [Commented] (SPARK-26777) SQL worked in 2.3.2 and fails in 2.4.0

2019-01-30 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755949#comment-16755949
 ] 

Hyukjin Kwon commented on SPARK-26777:
--

Please don't reopen. Take a look https://spark.apache.org/contributing.html

> SQL worked in 2.3.2 and fails in 2.4.0
> --
>
> Key: SPARK-26777
> URL: https://issues.apache.org/jira/browse/SPARK-26777
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Yuri Budilov
>Priority: Major
>
> Following SQL worked in Spark 2.3.2 and now fails on 2.4.0 (AWS EMR Spark)
>  PySpark call below:
> spark.sql("select partition_year_utc,partition_month_utc,partition_day_utc \
> from datalake_reporting.copy_of_leads_notification \
> where partition_year_utc = (select max(partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification) \
> and partition_month_utc = \
>  (select max(partition_month_utc) from 
> datalake_reporting.copy_of_leads_notification as m \
>  where \
>  m.partition_year_utc = (select max(partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification)) \
>  and partition_day_utc = (select max(d.partition_day_utc) from 
> datalake_reporting.copy_of_leads_notification as d \
>  where d.partition_month_utc = \
>  (select max(m1.partition_month_utc) from 
> datalake_reporting.copy_of_leads_notification as m1 \
>  where m1.partition_year_utc = \
>  (select max(y.partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification as y) \
>  ) \
>  ) \
>  order by 1 desc, 2 desc, 3 desc limit 1 ").show(1,False)
> Error: (no need for data, this is syntax).
> py4j.protocol.Py4JJavaError: An error occurred while calling o1326.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#4495 []
>  
> Note: all 3 columns in query are Partitioned columns - see bottom of the 
> schema)
>  
> Hive EMR AWS Schema is:
>  
> CREATE EXTERNAL TABLE `copy_of_leads_notification`(
> `message.environment.siteorigin` string, `dcpheader.dcploaddateutc` string, 
> `message.id` int, `source.properties._country` string, `message.created` 
> string, `dcpheader.generatedmessageid` string, `message.tags` bigint, 
> `source.properties._enqueuedtimeutc` string, `source.properties._leadtype` 
> string, `message.itemid` string, `message.prospect.postcode` string, 
> `message.prospect.email` string, `message.referenceid` string, 
> `message.item.year` string, `message.identifier` string, 
> `dcpheader.dcploadmonthutc` string, `message.processed` string, 
> `source.properties._tenant` string, `message.item.price` string, 
> `message.subscription.confirmresponse` boolean, `message.itemtype` string, 
> `message.prospect.lastname` string, `message.subscription.insurancequote` 
> boolean, `source.exchangename` string, 
> `message.prospect.identificationnumbers` bigint, 
> `message.environment.ipaddress` string, `dcpheader.dcploaddayutc` string, 
> `source.properties._itemtype` string, `source.properties._requesttype` 
> string, `message.item.make` string, `message.prospect.firstname` string, 
> `message.subscription.survey` boolean, `message.prospect.homephone` string, 
> `message.extendedproperties` bigint, `message.subscription.financequote` 
> boolean, `message.uniqueidentifier` string, `source.properties._id` string, 
> `dcpheader.sourcemessageguid` string, `message.requesttype` string, 
> `source.routingkey` string, `message.service` string, `message.item.model` 
> string, `message.environment.pagesource` string, `source.source` string, 
> `message.sellerid` string, `partition_date_utc` string, 
> `message.selleridentifier` string, `message.subscription.newsletter` boolean, 
> `dcpheader.dcploadyearutc` string, `message.leadtype` string, 
> `message.history` bigint, `message.callconnect.calloutcome` string, 
> `message.callconnect.datecreatedutc` string, 
> `message.callconnect.callrecordingurl` string, 
> `message.callconnect.transferoutcome` string, 
> `message.callconnect.hiderecording` boolean, 
> `message.callconnect.callstartutc` string, `message.callconnect.code` string, 
> `message.callconnect.callduration` string, `message.fraudnetinfo` string, 
> `message.callconnect.answernumber` string, `message.environment.sourcedevice` 
> string, `message.comments` string, `message.fraudinfo.servervariables` 
> bigint, `message.callconnect.servicenumber` string, 
> `message.callconnect.callid` string, `message.callconnect.voicemailurl` 
> string, `message.item.stocknumber` string, 
> `message.callconnect.answerduration` string, `message.callconnect.callendutc` 
> string, `message.item.series` string, `message.item.detailsurl` string, 
> `message.item.pricetype` string, `message.item.description` string, 
> `message.item.colour` string, `message.item.badge` 

[jira] [Resolved] (SPARK-26777) SQL worked in 2.3.2 and fails in 2.4.0

2019-01-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26777.
--
Resolution: Incomplete

> SQL worked in 2.3.2 and fails in 2.4.0
> --
>
> Key: SPARK-26777
> URL: https://issues.apache.org/jira/browse/SPARK-26777
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Yuri Budilov
>Priority: Major
>
> Following SQL worked in Spark 2.3.2 and now fails on 2.4.0 (AWS EMR Spark)
>  PySpark call below:
> spark.sql("select partition_year_utc,partition_month_utc,partition_day_utc \
> from datalake_reporting.copy_of_leads_notification \
> where partition_year_utc = (select max(partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification) \
> and partition_month_utc = \
>  (select max(partition_month_utc) from 
> datalake_reporting.copy_of_leads_notification as m \
>  where \
>  m.partition_year_utc = (select max(partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification)) \
>  and partition_day_utc = (select max(d.partition_day_utc) from 
> datalake_reporting.copy_of_leads_notification as d \
>  where d.partition_month_utc = \
>  (select max(m1.partition_month_utc) from 
> datalake_reporting.copy_of_leads_notification as m1 \
>  where m1.partition_year_utc = \
>  (select max(y.partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification as y) \
>  ) \
>  ) \
>  order by 1 desc, 2 desc, 3 desc limit 1 ").show(1,False)
> Error: (no need for data, this is syntax).
> py4j.protocol.Py4JJavaError: An error occurred while calling o1326.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#4495 []
>  
> Note: all 3 columns in query are Partitioned columns - see bottom of the 
> schema)
>  
> Hive EMR AWS Schema is:
>  
> CREATE EXTERNAL TABLE `copy_of_leads_notification`(
> `message.environment.siteorigin` string, `dcpheader.dcploaddateutc` string, 
> `message.id` int, `source.properties._country` string, `message.created` 
> string, `dcpheader.generatedmessageid` string, `message.tags` bigint, 
> `source.properties._enqueuedtimeutc` string, `source.properties._leadtype` 
> string, `message.itemid` string, `message.prospect.postcode` string, 
> `message.prospect.email` string, `message.referenceid` string, 
> `message.item.year` string, `message.identifier` string, 
> `dcpheader.dcploadmonthutc` string, `message.processed` string, 
> `source.properties._tenant` string, `message.item.price` string, 
> `message.subscription.confirmresponse` boolean, `message.itemtype` string, 
> `message.prospect.lastname` string, `message.subscription.insurancequote` 
> boolean, `source.exchangename` string, 
> `message.prospect.identificationnumbers` bigint, 
> `message.environment.ipaddress` string, `dcpheader.dcploaddayutc` string, 
> `source.properties._itemtype` string, `source.properties._requesttype` 
> string, `message.item.make` string, `message.prospect.firstname` string, 
> `message.subscription.survey` boolean, `message.prospect.homephone` string, 
> `message.extendedproperties` bigint, `message.subscription.financequote` 
> boolean, `message.uniqueidentifier` string, `source.properties._id` string, 
> `dcpheader.sourcemessageguid` string, `message.requesttype` string, 
> `source.routingkey` string, `message.service` string, `message.item.model` 
> string, `message.environment.pagesource` string, `source.source` string, 
> `message.sellerid` string, `partition_date_utc` string, 
> `message.selleridentifier` string, `message.subscription.newsletter` boolean, 
> `dcpheader.dcploadyearutc` string, `message.leadtype` string, 
> `message.history` bigint, `message.callconnect.calloutcome` string, 
> `message.callconnect.datecreatedutc` string, 
> `message.callconnect.callrecordingurl` string, 
> `message.callconnect.transferoutcome` string, 
> `message.callconnect.hiderecording` boolean, 
> `message.callconnect.callstartutc` string, `message.callconnect.code` string, 
> `message.callconnect.callduration` string, `message.fraudnetinfo` string, 
> `message.callconnect.answernumber` string, `message.environment.sourcedevice` 
> string, `message.comments` string, `message.fraudinfo.servervariables` 
> bigint, `message.callconnect.servicenumber` string, 
> `message.callconnect.callid` string, `message.callconnect.voicemailurl` 
> string, `message.item.stocknumber` string, 
> `message.callconnect.answerduration` string, `message.callconnect.callendutc` 
> string, `message.item.series` string, `message.item.detailsurl` string, 
> `message.item.pricetype` string, `message.item.description` string, 
> `message.item.colour` string, `message.item.badge` string, 
> `message.item.odometer` string, `message.environment.requestheader` string, 
> 

[jira] [Commented] (SPARK-26766) Remove the list of filesystems from HadoopDelegationTokenProvider.obtainDelegationTokens

2019-01-30 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755934#comment-16755934
 ] 

Gabor Somogyi commented on SPARK-26766:
---

Considering the size of the hadoopFSsToAccess dependency, your reasons + the 
discussion [here|https://github.com/apache/spark/pull/23686] I think it's 
better to move hadoopFSsToAccess. Started to create a PR...

> Remove the list of filesystems from 
> HadoopDelegationTokenProvider.obtainDelegationTokens
> 
>
> Key: SPARK-26766
> URL: https://issues.apache.org/jira/browse/SPARK-26766
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> This was discussed in previous PR 
> [here|https://github.com/apache/spark/pull/23499/files#diff-406f99efa37915001b613de47815e25cR54].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25420) Dataset.count() every time is different.

2019-01-30 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755910#comment-16755910
 ] 

Jungtaek Lim commented on SPARK-25420:
--

[~jeffrey.mak]
Could you rerun your query against master branch, or at least Spark 2.4? Also 
could you provide query plan (logical, optimized, physical) via replacing 
`.show()` with `.explain(true)`?

Your case seems suspicious - we just cannot reproduce it easily unless we don't 
have full input data, or any subset of data which is enough to reproduce this.

> Dataset.count()  every time is different.
> -
>
> Key: SPARK-25420
> URL: https://issues.apache.org/jira/browse/SPARK-25420
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: spark2.3
> standalone
>Reporter: huanghuai
>Priority: Major
>  Labels: SQL
>
> Dataset dataset = sparkSession.read().format("csv").option("sep", 
> ",").option("inferSchema", "true")
>  .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
>  .option("encoding", "UTF-8")
>  .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");
> System.out.println("source count="+dataset.count());
> Dataset dropDuplicates = dataset.dropDuplicates(new 
> String[]\{"DATE","TIME","VEL","COMPANY"});
> System.out.println("dropDuplicates count1="+dropDuplicates.count());
> System.out.println("dropDuplicates count2="+dropDuplicates.count());
> Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 
> and (status = 0 or status = 1)");
> System.out.println("filter count1="+filter.count());
> System.out.println("filter count2="+filter.count());
> System.out.println("filter count3="+filter.count());
> System.out.println("filter count4="+filter.count());
> System.out.println("filter count5="+filter.count());
>  
>  
> --The above is code 
> ---
>  
>  
> console output:
> source count=459275
> dropDuplicates count1=453987
> dropDuplicates count2=453987
> filter count1=445798
> filter count2=445797
> filter count3=445797
> filter count4=445798
> filter count5=445799
>  
> question:
>  
> Why is filter.count() different everytime?
> if I remove dropDuplicates() everything will be ok!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26779) NullPointerException when disable wholestage codegen

2019-01-30 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755881#comment-16755881
 ] 

Marco Gaido commented on SPARK-26779:
-

I'd say this is most likely just a duplicate of SPARK-23731.

> NullPointerException when disable wholestage codegen
> 
>
> Key: SPARK-26779
> URL: https://issues.apache.org/jira/browse/SPARK-26779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiaoju Wu
>Priority: Trivial
>
> When running TPCDSSuite with wholestage codegen disabled, NPE is thrown in q9:
> java.lang.NullPointerException at
> org.apache.spark.sql.execution.FileSourceScanExec.(DataSourceScanExec.scala:170)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:613)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:160)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.immutable.List.map(List.scala:285) at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.immutable.List.map(List.scala:285) at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.immutable.List.map(List.scala:285) at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.immutable.List.map(List.scala:285) at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> 

[jira] [Commented] (SPARK-25420) Dataset.count() every time is different.

2019-01-30 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755837#comment-16755837
 ] 

Marco Gaido commented on SPARK-25420:
-

[~jeffrey.mak] I cannot reproduce your issue on current master branch. I 
created a test.csv file with the data you provided above and run:

{code}
scala> val drkcard_0_df = spark.read.csv("test.csv")
drkcard_0_df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 4 
more fields]

scala> drkcard_0_df.show()
+---+---+---+---+++
|_c0|_c1|_c2|_c3| _c4| _c5|
+---+---+---+---+++
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...|John|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...| Tom|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...|Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...|   Mabel|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|   James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|Laurence|
+---+---+---+---+++

scala> val dropDup_0 = 
drkcard_0_df.dropDuplicates(Seq("_c0","_c1","_c2","_c3","_c4"))
dropDup_0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_c0: 
string, _c1: string ... 4 more fields]

scala> dropDup_0.show()
+---+---+---+---++-+
|_c0|_c1|_c2|_c3| _c4|  _c5|
+---+---+---+---++-+
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...| Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...| John|
+---+---+---+---++-+


scala> dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and 
_c2='83' and _c4='180919212732008200218'").show()
+---+---+---+---++-+
|_c0|_c1|_c2|_c3| _c4|  _c5|
+---+---+---+---++-+
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...| Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...| John|
+---+---+---+---++-+


scala> dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and 
_c2='83' and _c4='180919212732008200218'").show()
+---+---+---+---++-+
|_c0|_c1|_c2|_c3| _c4|  _c5|
+---+---+---+---++-+
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...| Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...| John|
+---+---+---+---++-+


scala> dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and 
_c2='83' and _c4='180919212732008200218'").show()
+---+---+---+---++-+
|_c0|_c1|_c2|_c3| _c4|  _c5|
+---+---+---+---++-+
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...| Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...| John|
+---+---+---+---++-+


scala> dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and 
_c2='83' and _c4='180919212732008200218'").show()
+---+---+---+---++-+
|_c0|_c1|_c2|_c3| _c4|  _c5|
+---+---+---+---++-+
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...| Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...| John|
+---+---+---+---++-+


scala> dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and 
_c2='83' and _c4='180919212732008200218'").show()
+---+---+---+---++-+
|_c0|_c1|_c2|_c3| _c4|  _c5|
+---+---+---+---++-+
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...| Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...| 

  1   2   >