[jira] [Commented] (SPARK-26945) Python streaming tests flaky while cleaning temp directories after StreamingQuery.stop

2019-02-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775815#comment-16775815
 ] 

Hyukjin Kwon commented on SPARK-26945:
--

Thanks for reporting this, [~abellina]

> Python streaming tests flaky while cleaning temp directories after 
> StreamingQuery.stop
> --
>
> Key: SPARK-26945
> URL: https://issues.apache.org/jira/browse/SPARK-26945
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> From the test code, it seems like the `shmutil.rmtree` function is trying to 
> delete a directory, but there's likely another thread adding entries to a 
> directory, so when it gets to `os.rmdir(path)` it blows up. I think the test 
> (and other streaming tests) should call `q.awaitTermination` after `q.stop`, 
> before going on. I'll file a separate jira.
> {noformat}
> ERROR: test_query_manager_await_termination 
> (pyspark.sql.tests.test_streaming.StreamingTests)
> --
> Traceback (most recent call last):
>  File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests/test_streaming.py",
>  line 259, in test_query_manager_await_termination
>  shutil.rmtree(tmpPath)
>  File "/home/anaconda/lib/python2.7/shutil.py", line 256, in rmtree
>  onerror(os.rmdir, path, sys.exc_info())
>  File "/home/anaconda/lib/python2.7/shutil.py", line 254, in rmtree
>  os.rmdir(path)
> OSError: [Errno 39] Directory not empty: 
> '/home/jenkins/workspace/SparkPullRequestBuilder/python/target/072153bd-f981-47be-bda2-e2b657a16f65/tmp4WGp7n'{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26945) Python streaming tests flaky while cleaning temp directories after StreamingQuery.stop

2019-02-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26945.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23870
[https://github.com/apache/spark/pull/23870]

> Python streaming tests flaky while cleaning temp directories after 
> StreamingQuery.stop
> --
>
> Key: SPARK-26945
> URL: https://issues.apache.org/jira/browse/SPARK-26945
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> From the test code, it seems like the `shmutil.rmtree` function is trying to 
> delete a directory, but there's likely another thread adding entries to a 
> directory, so when it gets to `os.rmdir(path)` it blows up. I think the test 
> (and other streaming tests) should call `q.awaitTermination` after `q.stop`, 
> before going on. I'll file a separate jira.
> {noformat}
> ERROR: test_query_manager_await_termination 
> (pyspark.sql.tests.test_streaming.StreamingTests)
> --
> Traceback (most recent call last):
>  File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests/test_streaming.py",
>  line 259, in test_query_manager_await_termination
>  shutil.rmtree(tmpPath)
>  File "/home/anaconda/lib/python2.7/shutil.py", line 256, in rmtree
>  onerror(os.rmdir, path, sys.exc_info())
>  File "/home/anaconda/lib/python2.7/shutil.py", line 254, in rmtree
>  os.rmdir(path)
> OSError: [Errno 39] Directory not empty: 
> '/home/jenkins/workspace/SparkPullRequestBuilder/python/target/072153bd-f981-47be-bda2-e2b657a16f65/tmp4WGp7n'{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26945) Python streaming tests flaky while cleaning temp directories after StreamingQuery.stop

2019-02-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26945:


Assignee: Hyukjin Kwon

> Python streaming tests flaky while cleaning temp directories after 
> StreamingQuery.stop
> --
>
> Key: SPARK-26945
> URL: https://issues.apache.org/jira/browse/SPARK-26945
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> From the test code, it seems like the `shmutil.rmtree` function is trying to 
> delete a directory, but there's likely another thread adding entries to a 
> directory, so when it gets to `os.rmdir(path)` it blows up. I think the test 
> (and other streaming tests) should call `q.awaitTermination` after `q.stop`, 
> before going on. I'll file a separate jira.
> {noformat}
> ERROR: test_query_manager_await_termination 
> (pyspark.sql.tests.test_streaming.StreamingTests)
> --
> Traceback (most recent call last):
>  File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests/test_streaming.py",
>  line 259, in test_query_manager_await_termination
>  shutil.rmtree(tmpPath)
>  File "/home/anaconda/lib/python2.7/shutil.py", line 256, in rmtree
>  onerror(os.rmdir, path, sys.exc_info())
>  File "/home/anaconda/lib/python2.7/shutil.py", line 254, in rmtree
>  os.rmdir(path)
> OSError: [Errno 39] Directory not empty: 
> '/home/jenkins/workspace/SparkPullRequestBuilder/python/target/072153bd-f981-47be-bda2-e2b657a16f65/tmp4WGp7n'{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16183) Large Spark SQL commands cause StackOverflowError in parser when using sqlContext.sql

2019-02-22 Thread Matt Saunders (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775788#comment-16775788
 ] 

Matt Saunders commented on SPARK-16183:
---

It appears that this problem is still occurring as of Feb 2019 and Spark 2.4.0. 
As a workaround, you can use Dataset.checkpoint (or .localCheckpoint) to 
truncate the logical plan of the Dataset between transformations and avoid the 
stack overflow error.

> Large Spark SQL commands cause StackOverflowError in parser when using 
> sqlContext.sql
> -
>
> Key: SPARK-16183
> URL: https://issues.apache.org/jira/browse/SPARK-16183
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: Running on AWS EMR
>Reporter: Matthew Porter
>Priority: Major
>
> Hi,
> I have created a PySpark SQL-based tool which auto-generates a complex SQL 
> command to be run via sqlContext.sql(cmd) based on a large number of 
> parameters. As the number of input files to be filtered and joined in this 
> query grows, so does the length of the SQL query. The tool runs fine up until 
> about 200+ files are included in the join, at which point the SQL command 
> becomes very long (~100K characters). It is only on these longer queries that 
> Spark fails, throwing an exception due to what seems to be too much recursion 
> occurring within the SparkSQL parser:
> {code}
> Traceback (most recent call last):
> ...
> merged_df = sqlsc.sql(cmd)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 
> 580, in sql
>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", 
> line 813, in __call__
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, 
> in deco
>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 
> 308, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o173.sql.
> : java.lang.StackOverflowError
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.sca

[jira] [Commented] (SPARK-26977) Warn against subclassing scala.App doesn't work

2019-02-22 Thread Manu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775783#comment-16775783
 ] 

Manu Zhang commented on SPARK-26977:


I'd love to

> Warn against subclassing scala.App doesn't work
> ---
>
> Key: SPARK-26977
> URL: https://issues.apache.org/jira/browse/SPARK-26977
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Manu Zhang
>Priority: Minor
>
> As per discussion in 
> [PR#3497|https://github.com/apache/spark/pull/3497#discussion_r258412735], 
> the warn against subclassing scala.App doesn't work. For example,
> {code:scala}
> object Test extends scala.App {
>// spark code
> }
> {code}
> Scala will compile {{object Test}} into two Java classes, {{Test}} passed in 
> by user and {{Test$}} subclassing {{scala.App}}. Currect code checks against 
> {{Test}}  and thus there will be no warn when user's application subclassing 
> {{scala.App}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25574) Add an option `keepQuotes` for parsing csv file

2019-02-22 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian resolved SPARK-25574.
-
Resolution: Invalid

> Add an option `keepQuotes` for parsing csv  file
> 
>
> Key: SPARK-25574
> URL: https://issues.apache.org/jira/browse/SPARK-25574
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> In our project, when we read the CSV file, we hope to keep quotes.
> For example:
> We have such a record in the CSV file.:
> *ab,cc,,"c,ddd"*
> We hope it displays like this:
> |_c0|_c1|_c2|    _c3|
> |  ab|cc  |null|*"c,ddd"*|
>  
> Not like this:
> |_c0|_c1|_c2|  _c3|
> |  ab|cc  |null|c,ddd|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26977) Warn against subclassing scala.App doesn't work

2019-02-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775776#comment-16775776
 ] 

Sean Owen commented on SPARK-26977:
---

Sure, would you like to make a pull request?

> Warn against subclassing scala.App doesn't work
> ---
>
> Key: SPARK-26977
> URL: https://issues.apache.org/jira/browse/SPARK-26977
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Manu Zhang
>Priority: Minor
>
> As per discussion in 
> [PR#3497|https://github.com/apache/spark/pull/3497#discussion_r258412735], 
> the warn against subclassing scala.App doesn't work. For example,
> {code:scala}
> object Test extends scala.App {
>// spark code
> }
> {code}
> Scala will compile {{object Test}} into two Java classes, {{Test}} passed in 
> by user and {{Test$}} subclassing {{scala.App}}. Currect code checks against 
> {{Test}}  and thus there will be no warn when user's application subclassing 
> {{scala.App}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26809) insert overwrite directory + concat function => error

2019-02-22 Thread Alessandro Bellina (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775772#comment-16775772
 ] 

Alessandro Bellina edited comment on SPARK-26809 at 2/23/19 3:25 AM:
-

This does it. Didn't need the limit to reproduce:

{noformat}
insert overwrite directory '/tmp/SPARK-26809' 
select concat(col1, col2) 
from ((select "foo" as col1, "bar" as col2));
{noformat}

This also triggers it:

{noformat}
insert overwrite directory '/tmp/SPARK-26809' 
select concat("foo", "bar") 
{noformat}

{noformat}
Caused by: org.apache.hadoop.hive.serde2.SerDeException: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145)
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:121)
at 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:104)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:109)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$14(FileFormatWriter.scala:177)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:426)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:429)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}



was (Author: abellina):
This does it. Didn't need the limit to reproduce:

{noformat}
insert overwrite directory '/tmp/SPARK-26809' 
select concat(col1, col2) 
{noformat}

This also triggers it:

{noformat}
insert overwrite directory '/tmp/SPARK-26809' 
select concat(col1, col2) 
from ((select "foo" as col1, "bar" as col2));
{noformat}

{noformat}
Caused by: org.apache.hadoop.hive.serde2.SerDeException: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145)
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:121)
at 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:104)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:109)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$14(FileFormatWriter.scala:177)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:426)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:429)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}


> insert overwrite directory + concat function => error
> -
>
> Key: SPARK-26809
> URL: https://issues.apache.org/jira/browse/SPARK-26809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: ant_nebula
>Priority: Critical
>
> insert 

[jira] [Comment Edited] (SPARK-26809) insert overwrite directory + concat function => error

2019-02-22 Thread Alessandro Bellina (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775772#comment-16775772
 ] 

Alessandro Bellina edited comment on SPARK-26809 at 2/23/19 3:25 AM:
-

This does it. Didn't need the limit to reproduce:

{noformat}
insert overwrite directory '/tmp/SPARK-26809' 
select concat(col1, col2) 
{noformat}

This also triggers it:

{noformat}
insert overwrite directory '/tmp/SPARK-26809' 
select concat(col1, col2) 
from ((select "foo" as col1, "bar" as col2));
{noformat}

{noformat}
Caused by: org.apache.hadoop.hive.serde2.SerDeException: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145)
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:121)
at 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:104)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:109)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$14(FileFormatWriter.scala:177)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:426)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:429)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}



was (Author: abellina):
This does it. Didn't need the limit to reproduce:

{noformat}
insert overwrite directory '/tmp/SPARK-26809' 
select concat(col1, col2) 
from ((select "foo" as col1, "bar" as col2));
{noformat}

This also triggers it:

{noformat}
insert overwrite directory '/tmp/SPARK-26809' 
select concat(col1, col2) 
from ((select "foo" as col1, "bar" as col2));
{noformat}

{noformat}
Caused by: org.apache.hadoop.hive.serde2.SerDeException: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145)
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:121)
at 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:104)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:109)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$14(FileFormatWriter.scala:177)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:426)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:429)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}


> insert overwrite directory + concat function => error
> -
>
> Key: SPARK-26809
> URL: https://issues.apache.org/jira/browse/SPARK-26809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: ant_nebula

[jira] [Comment Edited] (SPARK-26809) insert overwrite directory + concat function => error

2019-02-22 Thread Alessandro Bellina (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775772#comment-16775772
 ] 

Alessandro Bellina edited comment on SPARK-26809 at 2/23/19 3:25 AM:
-

This does it. Didn't need the limit to reproduce:

{noformat}
insert overwrite directory '/tmp/SPARK-26809' 
select concat(col1, col2) 
from ((select "foo" as col1, "bar" as col2));
{noformat}

This also triggers it:

{noformat}
insert overwrite directory '/tmp/SPARK-26809' 
select concat(col1, col2) 
from ((select "foo" as col1, "bar" as col2));
{noformat}

{noformat}
Caused by: org.apache.hadoop.hive.serde2.SerDeException: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145)
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:121)
at 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:104)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:109)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$14(FileFormatWriter.scala:177)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:426)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:429)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}



was (Author: abellina):
This does it. Didn't need the limit to reproduce:

{noformat}
insert overwrite directory '/tmp/SPARK-26809' 
select concat(col1, col2) 
from ((select "foo" as col1, "bar" as col2));
{noformat}

{noformat}
Caused by: org.apache.hadoop.hive.serde2.SerDeException: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145)
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:121)
at 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:104)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:109)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$14(FileFormatWriter.scala:177)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:426)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:429)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}


> insert overwrite directory + concat function => error
> -
>
> Key: SPARK-26809
> URL: https://issues.apache.org/jira/browse/SPARK-26809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: ant_nebula
>Priority: Critical
>
> insert overwrite directory '/tmp/xx'
> select concat(col1, col2)
> from tableXX
> li

[jira] [Created] (SPARK-26977) Warn against subclassing scala.App doesn't work

2019-02-22 Thread Manu Zhang (JIRA)
Manu Zhang created SPARK-26977:
--

 Summary: Warn against subclassing scala.App doesn't work
 Key: SPARK-26977
 URL: https://issues.apache.org/jira/browse/SPARK-26977
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.4.0
Reporter: Manu Zhang


As per discussion in 
[PR#3497|https://github.com/apache/spark/pull/3497#discussion_r258412735], the 
warn against subclassing scala.App doesn't work. For example,


{code:scala}
object Test extends scala.App {
   // spark code
}
{code}

Scala will compile {{object Test}} into two Java classes, {{Test}} passed in by 
user and {{Test$}} subclassing {{scala.App}}. Currect code checks against 
{{Test}}  and thus there will be no warn when user's application subclassing 
{{scala.App}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26809) insert overwrite directory + concat function => error

2019-02-22 Thread Alessandro Bellina (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775772#comment-16775772
 ] 

Alessandro Bellina commented on SPARK-26809:


This does it. Didn't need the limit to reproduce:

{noformat}
insert overwrite directory '/tmp/SPARK-26809' 
select concat(col1, col2) 
from ((select "foo" as col1, "bar" as col2));
{noformat}

{noformat}
Caused by: org.apache.hadoop.hive.serde2.SerDeException: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
while columns.types has 1 elements!
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145)
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:121)
at 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:104)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:109)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$14(FileFormatWriter.scala:177)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:426)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:429)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}


> insert overwrite directory + concat function => error
> -
>
> Key: SPARK-26809
> URL: https://issues.apache.org/jira/browse/SPARK-26809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: ant_nebula
>Priority: Critical
>
> insert overwrite directory '/tmp/xx'
> select concat(col1, col2)
> from tableXX
> limit 3
>  
> Caused by: org.apache.hadoop.hive.serde2.SerDeException: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 3 elements 
> while columns.types has 2 elements!
>  at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145)
>  at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
>  at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
>  at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:119)
>  at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
>  at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
>  at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:108)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:233)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>  at org.apache.spark.scheduler.Task.run(Task.scala:121)
>  at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24615) Accelerator-aware task scheduling for Spark

2019-02-22 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24615:
-

Assignee: Xingbo Jiang

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Xingbo Jiang
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26939) Fix some outdated comments about task schedulers

2019-02-22 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-26939:
-
Description: 
Some comments about task schedulers are outdated. They should be fixed.

* YarnClusterScheduler comments: reference to ClusterScheduler which is not 
used anymore.
* TaskSetManager comments: method statusUpdate does not exist as of now.

  was:
Some comments about task schedulers are outdated. They should be fixed.

* TaskScheduler comments: currently implemented exclusively by 
  org.apache.spark.scheduler.TaskSchedulerImpl. This is not true as of now.
* YarnClusterScheduler comments: reference to ClusterScheduler which is not 
used anymore.
* TaskSetManager comments: method statusUpdate does not exist as of now.


> Fix some outdated comments about task schedulers
> 
>
> Key: SPARK-26939
> URL: https://issues.apache.org/jira/browse/SPARK-26939
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Minor
>
> Some comments about task schedulers are outdated. They should be fixed.
> * YarnClusterScheduler comments: reference to ClusterScheduler which is not 
> used anymore.
> * TaskSetManager comments: method statusUpdate does not exist as of now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26975) Support nested-column pruning over limit/sample/repartition

2019-02-22 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-26975:
-

Assignee: Dongjoon Hyun

> Support nested-column pruning over limit/sample/repartition
> ---
>
> Key: SPARK-26975
> URL: https://issues.apache.org/jira/browse/SPARK-26975
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> As SPARK-26958 shows the benchmark, nested-column pruning has limitations. 
> This issue aims to remove the limitations on `limit/repartition/sample`. In 
> this issue, repartition means `Repartition`, not `RepartitionByExpression`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26215) define reserved keywords after SQL standard

2019-02-22 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-26215.
--
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 3.0.0

Resolved by [https://github.com/apache/spark/pull/23259]

> define reserved keywords after SQL standard
> ---
>
> Key: SPARK-26215
> URL: https://issues.apache.org/jira/browse/SPARK-26215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>
> There are 2 kinds of SQL keywords: reserved and non-reserved. Reserved 
> keywords can't be used as identifiers.
> In Spark SQL, we are too tolerant about non-reserved keywors. A lot of 
> keywords are non-reserved and sometimes it cause ambiguity (IIRC we hit a 
> problem when improving the INTERVAL syntax).
> I think it will be better to just follow other databases or SQL standard to 
> define reserved keywords, so that we don't need to think very hard about how 
> to avoid ambiguity.
> For reference: https://www.postgresql.org/docs/8.1/sql-keywords-appendix.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26976) Forbid reserved keywords as identifiers when ANSI mode is on

2019-02-22 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-26976:


 Summary: Forbid reserved keywords as identifiers when ANSI mode is 
on
 Key: SPARK-26976
 URL: https://issues.apache.org/jira/browse/SPARK-26976
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Takeshi Yamamuro
Assignee: Takeshi Yamamuro


We need to throw an exception to forbid reserved keywords as identifiers when 
ANSI mode is on.

This is a follow-up of SPARK-26215.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26918) All .md should have ASF license header

2019-02-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26918:


Assignee: (was: Apache Spark)

> All .md should have ASF license header
> --
>
> Key: SPARK-26918
> URL: https://issues.apache.org/jira/browse/SPARK-26918
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Priority: Major
>
> per policy, all md files should have the header, like eg. 
> [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md]
>  or
> [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md]
>  
> currently it does not
> [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26918) All .md should have ASF license header

2019-02-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26918:


Assignee: Apache Spark

> All .md should have ASF license header
> --
>
> Key: SPARK-26918
> URL: https://issues.apache.org/jira/browse/SPARK-26918
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Major
>
> per policy, all md files should have the header, like eg. 
> [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md]
>  or
> [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md]
>  
> currently it does not
> [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26918) All .md should have ASF license header

2019-02-22 Thread Mani M (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775673#comment-16775673
 ] 

Mani M edited comment on SPARK-26918 at 2/22/19 11:02 PM:
--

I can take it up this change


was (Author: rmsm...@gmail.com):
I can take it up this project

> All .md should have ASF license header
> --
>
> Key: SPARK-26918
> URL: https://issues.apache.org/jira/browse/SPARK-26918
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Priority: Major
>
> per policy, all md files should have the header, like eg. 
> [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md]
>  or
> [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md]
>  
> currently it does not
> [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26918) All .md should have ASF license header

2019-02-22 Thread Mani M (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775673#comment-16775673
 ] 

Mani M commented on SPARK-26918:


I can take it up this project

> All .md should have ASF license header
> --
>
> Key: SPARK-26918
> URL: https://issues.apache.org/jira/browse/SPARK-26918
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Priority: Major
>
> per policy, all md files should have the header, like eg. 
> [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md]
>  or
> [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md]
>  
> currently it does not
> [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26651) Use Proleptic Gregorian calendar

2019-02-22 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-26651.

Resolution: Done

> Use Proleptic Gregorian calendar
> 
>
> Key: SPARK-26651
> URL: https://issues.apache.org/jira/browse/SPARK-26651
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: ReleaseNote
>
> Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in 
> date/timestamp parsing, functions and expressions. The ticket aims to switch 
> Spark on Proleptic Gregorian calendar, and use java.time classes introduced 
> in Java 8 for timestamp/date manipulations. One of the purpose of switching 
> on Proleptic Gregorian calendar is to conform to SQL standard which supposes 
> such calendar.
> *Release note:*
> Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, 
> formatting, and converting dates and timestamps as well as in extracting 
> sub-components like years, days and etc. It uses Java 8 API classes from the 
> java.time packages that based on [ISO chronology 
> |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html].
>  Previous versions of Spark performed those operations by using [the hybrid 
> calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]
>  (Julian + Gregorian). The changes might impact on the results for dates and 
> timestamps before October 15, 1582 (Gregorian).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26774) Document threading concerns in TaskSchedulerImpl

2019-02-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26774:


Assignee: Apache Spark

> Document threading concerns in TaskSchedulerImpl
> 
>
> Key: SPARK-26774
> URL: https://issues.apache.org/jira/browse/SPARK-26774
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Major
>
> TaskSchedulerImpl has a couple of places threading concerns aren't clearly 
> documented, which could improved a little.  There is also a race in 
> {{killTaskAttempt}} on {{taskIdToExecutorId}} (though I think nobody actually 
> uses this api).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26774) Document threading concerns in TaskSchedulerImpl

2019-02-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26774:


Assignee: (was: Apache Spark)

> Document threading concerns in TaskSchedulerImpl
> 
>
> Key: SPARK-26774
> URL: https://issues.apache.org/jira/browse/SPARK-26774
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> TaskSchedulerImpl has a couple of places threading concerns aren't clearly 
> documented, which could improved a little.  There is also a race in 
> {{killTaskAttempt}} on {{taskIdToExecutorId}} (though I think nobody actually 
> uses this api).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26774) Document threading concerns in TaskSchedulerImpl

2019-02-22 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-26774:
-
Description: TaskSchedulerImpl has a couple of places threading concerns 
aren't clearly documented, which could improved a little.  There is also a race 
in {{killTaskAttempt}} on {{taskIdToExecutorId}} (though I think nobody 
actually uses this api).  (was: TaskSchedulerImpl has a bunch of threading 
concerns, which are not well documented -- in fact the docs it has are somewhat 
misleading.  In particular, some of the methods should only be called within 
the DAGScheduler event loop.

This suggests some potential refactoring to avoid so many mixed concerns inside 
TaskSchedulerImpl, but that's a lot harder to do safely, I just want to add 
some comments.)

> Document threading concerns in TaskSchedulerImpl
> 
>
> Key: SPARK-26774
> URL: https://issues.apache.org/jira/browse/SPARK-26774
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> TaskSchedulerImpl has a couple of places threading concerns aren't clearly 
> documented, which could improved a little.  There is also a race in 
> {{killTaskAttempt}} on {{taskIdToExecutorId}} (though I think nobody actually 
> uses this api).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26950) Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values

2019-02-22 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26950:
--
Fix Version/s: 2.3.4

> Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values
> ---
>
> Key: SPARK-26950
> URL: https://issues.apache.org/jira/browse/SPARK-26950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.4, 2.4.2, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.4, 2.4.1, 3.0.0
>
>
> Apache Spark uses the predefined `Float.NaN` and `Double.NaN` for NaN values, 
> but there exists more NaN values with different binary presentations.
> {code}
> scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array
> res1: Array[Byte] = Array(127, -64, 0, 0)
> scala> val x = java.lang.Float.intBitsToFloat(-6966608)
> x: Float = NaN
> scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array
> res2: Array[Byte] = Array(-1, -107, -78, -80)
> {code}
> `RandomDataGenerator` generates these NaN values. It's good, but it causes 
> `checkEvaluationWithUnsafeProjection` failures due to the difference between 
> `UnsafeRow` binary presentation. The following is the UT failure instance. 
> This issue aims to fix this flakiness.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/
> {code}
> Failed
> org.apache.spark.sql.avro.AvroCatalystDataConversionSuite.flat schema 
> struct
>  with seed -81044812370056695
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26975) Support nested-column pruning over limit/sample/repartition

2019-02-22 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-26975:
-

 Summary: Support nested-column pruning over 
limit/sample/repartition
 Key: SPARK-26975
 URL: https://issues.apache.org/jira/browse/SPARK-26975
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


As SPARK-26958 shows the benchmark, nested-column pruning has limitations. This 
issue aims to remove the limitations on `limit/repartition/sample`. In this 
issue, repartition means `Repartition`, not `RepartitionByExpression`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22860) Spark workers log ssl passwords passed to the executors

2019-02-22 Thread tooptoop4 (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775630#comment-16775630
 ] 

tooptoop4 commented on SPARK-22860:
---

[~kabhwan]  spark.ssl.keyStorePassword and  spark.ssl.keyPassword don't need to 
be passed to  CoarseGrainedExecutorBackend. Only  spark.ssl.trustStorePassword 
is used

> Spark workers log ssl passwords passed to the executors
> ---
>
> Key: SPARK-22860
> URL: https://issues.apache.org/jira/browse/SPARK-22860
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Felix K.
>Priority: Major
>
> The workers log the spark.ssl.keyStorePassword and 
> spark.ssl.trustStorePassword passed by cli to the executor processes. The 
> ExecutorRunner should escape passwords to not appear in the worker's log 
> files in INFO level. In this example, you can see my 'SuperSecretPassword' in 
> a worker log:
> {code}
> 17/12/08 08:04:12 INFO ExecutorRunner: Launch command: 
> "/global/myapp/oem/jdk/bin/java" "-cp" 
> "/global/myapp/application/myapp_software/thing_loader_lib/core-repository-model-zzz-1.2.3-SNAPSHOT.jar
> [...]
> :/global/myapp/application/spark-2.1.1-bin-hadoop2.7/jars/*" "-Xmx16384M" 
> "-Dspark.authenticate.enableSaslEncryption=true" 
> "-Dspark.ssl.keyStorePassword=SuperSecretPassword" 
> "-Dspark.ssl.keyStore=/global/myapp/application/config/ssl/keystore.jks" 
> "-Dspark.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" 
> "-Dspark.ssl.enabled=true" "-Dspark.driver.port=39927" 
> "-Dspark.ssl.protocol=TLS" 
> "-Dspark.ssl.trustStorePassword=SuperSecretPassword" 
> "-Dspark.authenticate=true" "-Dmyapp_IMPORT_DATE=2017-10-30" 
> "-Dmyapp.config.directory=/global/myapp/application/config" 
> "-Dsolr.httpclient.builder.factory=com.company.myapp.loader.auth.LoaderConfigSparkSolrBasicAuthConfigurer"
>  
> "-Djavax.net.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks"
>  "-XX:+UseG1GC" "-XX:+UseStringDeduplication" 
> "-Dthings.loader.export.zzz_files=false" 
> "-Dlog4j.configuration=file:/global/myapp/application/config/spark-executor-log4j.properties"
>  "-XX:+HeapDumpOnOutOfMemoryError" "-XX:+UseStringDeduplication" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@192.168.0.1:39927" "--executor-id" "2" 
> "--hostname" "192.168.0.1" "--cores" "4" "--app-id" "app-20171208080412-" 
> "--worker-url" "spark://Worker@192.168.0.1:59530"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26975) Support nested-column pruning over limit/sample/repartition

2019-02-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26975:


Assignee: Apache Spark

> Support nested-column pruning over limit/sample/repartition
> ---
>
> Key: SPARK-26975
> URL: https://issues.apache.org/jira/browse/SPARK-26975
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> As SPARK-26958 shows the benchmark, nested-column pruning has limitations. 
> This issue aims to remove the limitations on `limit/repartition/sample`. In 
> this issue, repartition means `Repartition`, not `RepartitionByExpression`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26975) Support nested-column pruning over limit/sample/repartition

2019-02-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26975:


Assignee: (was: Apache Spark)

> Support nested-column pruning over limit/sample/repartition
> ---
>
> Key: SPARK-26975
> URL: https://issues.apache.org/jira/browse/SPARK-26975
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> As SPARK-26958 shows the benchmark, nested-column pruning has limitations. 
> This issue aims to remove the limitations on `limit/repartition/sample`. In 
> this issue, repartition means `Repartition`, not `RepartitionByExpression`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26895) When running spark 2.3 as a proxy user (--proxy-user), SparkSubmit fails to resolve globs owned by target user

2019-02-22 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26895:
--

Assignee: Alessandro Bellina

> When running spark 2.3 as a proxy user (--proxy-user), SparkSubmit fails to 
> resolve globs owned by target user
> --
>
> Key: SPARK-26895
> URL: https://issues.apache.org/jira/browse/SPARK-26895
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Alessandro Bellina
>Assignee: Alessandro Bellina
>Priority: Critical
>
> We are resolving globs in SparkSubmit here (by way of 
> prepareSubmitEnvironment) without first going into a doAs:
> https://github.com/apache/spark/blob/6c18d8d8079ac4d2d6dc7539601ab83fc5b51760/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L143
> Without first entering a doAs, as done here:
> [https://github.com/apache/spark/blob/6c18d8d8079ac4d2d6dc7539601ab83fc5b51760/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L151]
> So when running spark-submit with --proxy-user, and for example --archives, 
> it will fail to launch unless the location of the archive is open to the user 
> that executed spark-submit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26895) When running spark 2.3 as a proxy user (--proxy-user), SparkSubmit fails to resolve globs owned by target user

2019-02-22 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26895.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23806
[https://github.com/apache/spark/pull/23806]

> When running spark 2.3 as a proxy user (--proxy-user), SparkSubmit fails to 
> resolve globs owned by target user
> --
>
> Key: SPARK-26895
> URL: https://issues.apache.org/jira/browse/SPARK-26895
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Alessandro Bellina
>Assignee: Alessandro Bellina
>Priority: Critical
> Fix For: 3.0.0
>
>
> We are resolving globs in SparkSubmit here (by way of 
> prepareSubmitEnvironment) without first going into a doAs:
> https://github.com/apache/spark/blob/6c18d8d8079ac4d2d6dc7539601ab83fc5b51760/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L143
> Without first entering a doAs, as done here:
> [https://github.com/apache/spark/blob/6c18d8d8079ac4d2d6dc7539601ab83fc5b51760/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L151]
> So when running spark-submit with --proxy-user, and for example --archives, 
> it will fail to launch unless the location of the archive is open to the user 
> that executed spark-submit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true

2019-02-22 Thread Jean Georges Perrin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean Georges Perrin updated SPARK-26972:

Description: 
 

I found a few discrepencies while working with inferSchema set to true in CSV 
ingestion.

Given the following CSV in the attached books.csv:
{noformat}
id;authorId;title;releaseDate;link
1;1;Fantastic Beasts and Where to Find Them: The Original 
Screenplay;11/18/16;http://amzn.to/2kup94P
2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP
3;1;*The Tales of Beedle the Bard, Standard Edition (Harry 
Potter)*;12/4/08;http://amzn.to/2kYezqr
4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry 
Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n
5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; 
the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT
6;2;*Development Tools in 2006: any Room for a 4GL-style Language?
An independent study by Jean Georges Perrin, IIUG Board 
Member*;12/28/16;http://amzn.to/2vBxOe1
7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav
8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD
10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA
11;4;Diderot Encyclopedia: The Complete Illustrations 
1762-1777;;http://amzn.to/2i2zo3I
12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ
13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW
14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk
15;7;Soft Skills: The software developer's life 
manual;12/29/14;http://amzn.to/2zNnSyn
16;8;Of Mice and Men;;http://amzn.to/2zJjXoc
17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
programming*;8/28/14;http://amzn.to/2isdqoL
18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY
19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG
20;14;*Fables choisies; mises en vers par M. de La 
Fontaine*;9/1/1999;http://amzn.to/2yRH10W
21;15;Discourse on Method and Meditations on First 
Philosophy;6/15/1999;http://amzn.to/2hwB8zc
22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo
23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat}
And this Java code:
{code:java}
Dataset df = spark.read().format("csv")
 .option("header", "true")
 .option("multiline", true)
 .option("sep", ";")
 .option("quote", "*")
 .option("dateFormat", "M/d/y")
 .option("inferSchema", true)
 .load("data/books.csv");
df.show(7);
df.printSchema();
{code}
h1. In Spark v2.0.1

Output: 
{noformat}
+---+++---++
| id|authorId|   title|releaseDate|link|
+---+++---++
|  1|   1|Fantastic Beasts ...|   11/18/16|http://amzn.to/2k...|
|  2|   1|Harry Potter and ...|10/6/15|http://amzn.to/2l...|
|  3|   1|The Tales of Beed...|12/4/08|http://amzn.to/2k...|
|  4|   1|Harry Potter and ...|10/4/16|http://amzn.to/2k...|
|  5|   2|Informix 12.10 on...|4/23/17|http://amzn.to/2i...|
|  6|   2|Development Tools...|   12/28/16|http://amzn.to/2v...|
|  7|   3|Adventures of Huc...|.   5/26/94|http://amzn.to/2w...|
+---+++---++
only showing top 7 rows

Dataframe's schema:
root
|-- id: integer (nullable = true)
|-- authorId: integer (nullable = true)
|-- title: string (nullable = true)
|-- releaseDate: string (nullable = true)
|-- link: string (nullable = true)
{noformat}
*This is fine and the expected output*.
h1. Using Apache Spark v2.1.3

Excerpt of the dataframe content: 
{noformat}
++++---++
| id|authorId| title|releaseDate| link|
++++---++
| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|
| 6| 2|Development Tools...| null| null|
|An independent st...|12/28/16|http://amzn.to/2v...| null| null|
++++---++
only showing top 7 rows

Dataframe's schema:
root
|-- id: string (nullable = true)
|-- authorId: string (nullable = true)
|-- title: string (nullable = true)
|-- releaseDate: string (nullable = true)
|-- link: string (nullable = true){noformat}
 The *multiline* option is *not recognized*. And, of course, the schema is 
wrong.
h1. Using Apache Spark v2.2.3

Excerpt of the dataframe content: 
{noformat}
+---+++---++
| id|authorId| title|releaseDate| link
|
+---+++---

[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true

2019-02-22 Thread Jean Georges Perrin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean Georges Perrin updated SPARK-26972:

Description: 
 

I found a few discrepencies while working with inferSchema set to true in CSV 
ingestion.

Given the following CSV in the attached books.csv:
{noformat}
id;authorId;title;releaseDate;link
1;1;Fantastic Beasts and Where to Find Them: The Original 
Screenplay;11/18/16;http://amzn.to/2kup94P
2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP
3;1;*The Tales of Beedle the Bard, Standard Edition (Harry 
Potter)*;12/4/08;http://amzn.to/2kYezqr
4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry 
Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n
5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; 
the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT
6;2;*Development Tools in 2006: any Room for a 4GL-style Language?
An independent study by Jean Georges Perrin, IIUG Board 
Member*;12/28/16;http://amzn.to/2vBxOe1
7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav
8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD
10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA
11;4;Diderot Encyclopedia: The Complete Illustrations 
1762-1777;;http://amzn.to/2i2zo3I
12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ
13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW
14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk
15;7;Soft Skills: The software developer's life 
manual;12/29/14;http://amzn.to/2zNnSyn
16;8;Of Mice and Men;;http://amzn.to/2zJjXoc
17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
programming*;8/28/14;http://amzn.to/2isdqoL
18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY
19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG
20;14;*Fables choisies; mises en vers par M. de La 
Fontaine*;9/1/1999;http://amzn.to/2yRH10W
21;15;Discourse on Method and Meditations on First 
Philosophy;6/15/1999;http://amzn.to/2hwB8zc
22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo
23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat}
And this Java code:
{code:java}
Dataset df = spark.read().format("csv")
 .option("header", "true")
 .option("multiline", true)
 .option("sep", ";")
 .option("quote", "*")
 .option("dateFormat", "M/d/y")
 .option("inferSchema", true)
 .load("data/books.csv");
df.show(7);
df.printSchema();
{code}
h1. In Spark v2.0.1

Output: 
{noformat}
+---+++---++
| id|authorId| title|releaseDate| link|
+---+++---++
| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|
| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|
| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|
+---+++---++
only showing top 7 rows

Dataframe's schema:
root
|-- id: integer (nullable = true)
|-- authorId: integer (nullable = true)
|-- title: string (nullable = true)
|-- releaseDate: string (nullable = true)
|-- link: string (nullable = true)
{noformat}
*This is fine and the expected output*.
h1. Using Apache Spark v2.1.3

Excerpt of the dataframe content: 
{noformat}
++++---++
| id|authorId| title|releaseDate| link|
++++---++
| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|
| 6| 2|Development Tools...| null| null|
|An independent st...|12/28/16|http://amzn.to/2v...| null| null|
++++---++
only showing top 7 rows

Dataframe's schema:
root
|-- id: string (nullable = true)
|-- authorId: string (nullable = true)
|-- title: string (nullable = true)
|-- releaseDate: string (nullable = true)
|-- link: string (nullable = true){noformat}
 The *multiline* option is *not recognized*. And, of course, the schema is 
wrong.
h1. Using Apache Spark v2.2.3

Excerpt of the dataframe content: 
{noformat}
+---+++---++
| id|authorId| title|releaseDate| link
|
+---+++---++
| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
| 2| 1|Harry P

[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true

2019-02-22 Thread Jean Georges Perrin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean Georges Perrin updated SPARK-26972:

Description: 
 

 

Issue with CSV import and inferSchema set to true.

I found a few discrepencies while working with inferSchema set to true in CSV 
ingestion.

Given the following CSV in the attached books.csv:
{noformat}
id;authorId;title;releaseDate;link
1;1;Fantastic Beasts and Where to Find Them: The Original 
Screenplay;11/18/16;http://amzn.to/2kup94P
2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP
3;1;*The Tales of Beedle the Bard, Standard Edition (Harry 
Potter)*;12/4/08;http://amzn.to/2kYezqr
4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry 
Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n
5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; 
the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT
6;2;*Development Tools in 2006: any Room for a 4GL-style Language?
An independent study by Jean Georges Perrin, IIUG Board 
Member*;12/28/16;http://amzn.to/2vBxOe1
7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav
8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD
10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA
11;4;Diderot Encyclopedia: The Complete Illustrations 
1762-1777;;http://amzn.to/2i2zo3I
12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ
13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW
14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk
15;7;Soft Skills: The software developer's life 
manual;12/29/14;http://amzn.to/2zNnSyn
16;8;Of Mice and Men;;http://amzn.to/2zJjXoc
17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
programming*;8/28/14;http://amzn.to/2isdqoL
18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY
19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG
20;14;*Fables choisies; mises en vers par M. de La 
Fontaine*;9/1/1999;http://amzn.to/2yRH10W
21;15;Discourse on Method and Meditations on First 
Philosophy;6/15/1999;http://amzn.to/2hwB8zc
22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo
23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat}
And this Java code:
{code:java}
Dataset df = spark.read().format("csv")
 .option("header", "true")
 .option("multiline", true)
 .option("sep", ";")
 .option("quote", "*")
 .option("dateFormat", "M/d/y")
 .option("inferSchema", true)
 .load("data/books.csv");
df.show(7);
df.printSchema();
{code}
h1. In Spark v2.0.1

Output:

 
{noformat}
+---+++---++
| id|authorId| title|releaseDate| link|
+---+++---++
| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|
| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|
| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|
+---+++---++
only showing top 7 rows

Dataframe's schema:
root
|-- id: integer (nullable = true)
|-- authorId: integer (nullable = true)
|-- title: string (nullable = true)
|-- releaseDate: string (nullable = true)
|-- link: string (nullable = true)
{noformat}
 

 

*This is fine and the expected output*.
h1. Using Apache Spark v2.1.3

Excerpt of the dataframe content:

 
{noformat}
++++---++
| id|authorId| title|releaseDate| link|
++++---++
| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|
| 6| 2|Development Tools...| null| null|
|An independent st...|12/28/16|http://amzn.to/2v...| null| null|
++++---++
only showing top 7 rows

Dataframe's schema:
root
|-- id: string (nullable = true)
|-- authorId: string (nullable = true)
|-- title: string (nullable = true)
|-- releaseDate: string (nullable = true)
|-- link: string (nullable = true){noformat}
 

 

The *multiline* option is *not recognized*. And, of course, the schema is wrong.
h1. Using Apache Spark v2.2.3

Excerpt of the dataframe content:

 
{noformat}
+---+++---++
| id|authorId| title|releaseDate| link
|
+---+++---++
| 1

[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true

2019-02-22 Thread Jean Georges Perrin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean Georges Perrin updated SPARK-26972:

Description: 
 

 

Issue with CSV import and inferSchema set to true.

I found a few discrepencies while working with inferSchema set to true in CSV 
ingestion.

Given the following CSV in the attached books.csv:
{noformat}
id;authorId;title;releaseDate;link
1;1;Fantastic Beasts and Where to Find Them: The Original 
Screenplay;11/18/16;http://amzn.to/2kup94P
2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP
3;1;*The Tales of Beedle the Bard, Standard Edition (Harry 
Potter)*;12/4/08;http://amzn.to/2kYezqr
4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry 
Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n
5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; 
the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT
6;2;*Development Tools in 2006: any Room for a 4GL-style Language?
An independent study by Jean Georges Perrin, IIUG Board 
Member*;12/28/16;http://amzn.to/2vBxOe1
7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav
8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD
10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA
11;4;Diderot Encyclopedia: The Complete Illustrations 
1762-1777;;http://amzn.to/2i2zo3I
12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ
13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW
14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk
15;7;Soft Skills: The software developer's life 
manual;12/29/14;http://amzn.to/2zNnSyn
16;8;Of Mice and Men;;http://amzn.to/2zJjXoc
17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
programming*;8/28/14;http://amzn.to/2isdqoL
18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY
19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG
20;14;*Fables choisies; mises en vers par M. de La 
Fontaine*;9/1/1999;http://amzn.to/2yRH10W
21;15;Discourse on Method and Meditations on First 
Philosophy;6/15/1999;http://amzn.to/2hwB8zc
22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo
23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat}
 

And this Java code:
{code:java}
Dataset df = spark.read().format("csv")
 .option("header", "true")
 .option("multiline", true)
 .option("sep", ";")
 .option("quote", "*")
 .option("dateFormat", "M/d/y")
 .option("inferSchema", true)
 .load("data/books.csv");
df.show(7);
df.printSchema();
{code}
h1. In Spark v2.0.1
{code:java}
+---+++---++
| id|authorId| title|releaseDate| link|
+---+++---++
| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|
| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|
| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|
+---+++---++
only showing top 7 rows

Dataframe's schema:
root
|-- id: integer (nullable = true)
|-- authorId: integer (nullable = true)
|-- title: string (nullable = true)
|-- releaseDate: string (nullable = true)
|-- link: string (nullable = true)

{code}
*This is fine and the expected output*.
h1. Using Apache Spark v2.1.3

Excerpt of the dataframe content:

{{+---+----}}
 \{{ | id|authorId| title|releaseDate| link|}}
 {{ 
+---+----}}
 \{{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
 \{{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
 \{{ | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
 \{{ | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
 \{{ | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
 \{{ | 6| 2|Development Tools...| null| null|}}
 \{{ |An independent st...|12/28/16|http://amzn.to/2v...| null| null|}}
 {{ 
+---+----}}
 \{{ only showing top 7 rows}}{{Dataframe's schema:}}
 \{{ root}}
 \{{ |-- id: string (nullable = true)}}
 \{{ |-- authorId: string (nullable = true)}}
 \{{ |-- title: string (nullable = true)}}
 \{{ |-- releaseDate: string (nullable = true)}}
 \{{ |-- link: string (nullable = true)}}

The *multiline* option is *not recognized*. And, of course, the schema is wrong.
h1. Using Apache Spark v2.2.3

Excerpt of the dataframe content:

{{+--+----}}
 {{| id|authorId| title|releaseDate| link}

[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true

2019-02-22 Thread Jean Georges Perrin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean Georges Perrin updated SPARK-26972:

Description: 
 

 

Issue with CSV import and inferSchema set to true.

I found a few discrepencies while working with inferSchema set to true in CSV 
ingestion.

Given the following CSV in the attached books.csv:
{noformat}
id;authorId;title;releaseDate;link
1;1;Fantastic Beasts and Where to Find Them: The Original 
Screenplay;11/18/16;http://amzn.to/2kup94P
2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP
3;1;*The Tales of Beedle the Bard, Standard Edition (Harry 
Potter)*;12/4/08;http://amzn.to/2kYezqr
4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry 
Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n
5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; 
the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT
6;2;*Development Tools in 2006: any Room for a 4GL-style Language?
An independent study by Jean Georges Perrin, IIUG Board 
Member*;12/28/16;http://amzn.to/2vBxOe1
7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav
8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD
10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA
11;4;Diderot Encyclopedia: The Complete Illustrations 
1762-1777;;http://amzn.to/2i2zo3I
12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ
13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW
14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk
15;7;Soft Skills: The software developer's life 
manual;12/29/14;http://amzn.to/2zNnSyn
16;8;Of Mice and Men;;http://amzn.to/2zJjXoc
17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
programming*;8/28/14;http://amzn.to/2isdqoL
18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY
19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG
20;14;*Fables choisies; mises en vers par M. de La 
Fontaine*;9/1/1999;http://amzn.to/2yRH10W
21;15;Discourse on Method and Meditations on First 
Philosophy;6/15/1999;http://amzn.to/2hwB8zc
22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo
23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat}
 

{{id;authorId;title;releaseDate;link}}
 {{1;1;Fantastic Beasts and Where to Find Them: The Original 
Scree}}{{nplay;11/18/16;[http://amzn.to/2kup94P]}}
 {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
Potter; Book 1)*;10/6/15;[http://amzn.to/2l2lSwP]}}
 {{3;1;The Tales of Beedle the B}}{{ard, Standard Edition (Harry 
Potter);12/4/08;[http://amzn.to/2kYezqr]}}
 {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry 
Potter; Book 2)*;10/4/16;[http://amzn.to/2kYhL5n]}}
 {{5;2;Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the 
Apple; the Coffee; and a Great Database;4/23/17;[http://amzn.to/2i3mthT]}}
 {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language?}}

{{An independent study by Jean Georges Perrin, IIUG Board 
Member*;12/28/16;[http://amzn.to/2vBxOe1]}}
 {{7;3;Adventures of Huckleberry Finn;5/26/94;[http://amzn.to/2wOeOav]}}
 {{8;3;A Connecticut Yankee in King Arthur's 
Court;6/17/17;[http://amzn.to/2x1NuoD]}}
 {{10;4;Jacques le Fataliste;3/1/00;[http://amzn.to/2uZj2KA]}}
 {{11;4;Diderot Encyclopedia: The Complete Illustrations 
1762-1777;;[http://amzn.to/2i2zo3I]}}
 {{12;;A Woman in Berlin;7/11/06;[http://amzn.to/2i472WZ]}}
 {{13;6;Spring Boot in Action;1/3/16;[http://amzn.to/2hCPktW]}}
 {{14;6;Spring in Action: Covers Spring 4;11/28/14;[http://amzn.to/2yJLyCk]}}
 {{15;7;Soft Skills: The software developer's life 
manual;12/29/14;[http://amzn.to/2zNnSyn]}}
 {{16;8;Of Mice and Men;;[http://amzn.to/2zJjXoc]}}
 {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
programming*;8/28/14;[http://amzn.to/2isdqoL]}}
 {{18;12;Hamlet;6/8/12;[http://amzn.to/2yRbewY]}}
 {{19;13;Pensées;12/31/1670;[http://amzn.to/2jweHOG]}}
 {{20;14;*Fables choisies; mises en vers par M. de La 
Fontaine*;9/1/1999;[http://amzn.to/2yRH10W]}}
 {{21;15;Discourse on Method and Meditations on First 
Philosophy;6/15/1999;[http://amzn.to/2hwB8zc]}}
 {{22;12;Twelfth Night;7/1/4;[http://amzn.to/2zPYnwo]}}
 {{23;12;Macbeth;7/1/3;[http://amzn.to/2zPYnwo]}}

And this code:

{{Dataset df = spark.read().format("csv")}}
 {{ .option("header", "true")}}
 {{ .option("multiline", true)}}
 {{ .option("sep", ";")}}
 {{ .option("quote", "*")}}
 {{ .option("dateFormat", "M/d/y")}}
 {{ .option("inferSchema", true)}}
 {{ .load("data/books.csv");}}
 {{df.show(7);}}
 {{df.printSchema();}}
h1. In Spark v2.0.1

{{Excerpt of the dataframe content:}}
 {{+-+--+++---+}}
 {{| id|authorId| title|releaseDate| link|}}
 {{+-+--+++---+}}
 {{| 1| 1|Fantastic Beasts ...| 11/18/16

[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true

2019-02-22 Thread Jean Georges Perrin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean Georges Perrin updated SPARK-26972:

Description: 
 

 

Issue with CSV import and inferSchema set to true.

I found a few discrepencies while working with inferSchema set to true in CSV 
ingestion.

Given the following CSV:

{{id;authorId;title;releaseDate;link}}
{{1;1;Fantastic Beasts and Where to Find Them: The Original 
Scree}}{{nplay;11/18/16;[http://amzn.to/2kup94P]}}
{{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
Potter; Book 1)*;10/6/15;[http://amzn.to/2l2lSwP]}}
{{3;1;The Tales of Beedle the B}}{{ard, Standard Edition (Harry 
Potter);12/4/08;[http://amzn.to/2kYezqr]}}
{{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry 
Potter; Book 2)*;10/4/16;[http://amzn.to/2kYhL5n]}}
{{5;2;Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; 
the Coffee; and a Great Database;4/23/17;[http://amzn.to/2i3mthT]}}
{{6;2;*Development Tools in 2006: any Room for a 4GL-style Language?}}

{{An independent study by Jean Georges Perrin, IIUG Board 
Member*;12/28/16;[http://amzn.to/2vBxOe1]}}
{{7;3;Adventures of Huckleberry Finn;5/26/94;[http://amzn.to/2wOeOav]}}
{{8;3;A Connecticut Yankee in King Arthur's 
Court;6/17/17;[http://amzn.to/2x1NuoD]}}
{{10;4;Jacques le Fataliste;3/1/00;[http://amzn.to/2uZj2KA]}}
{{11;4;Diderot Encyclopedia: The Complete Illustrations 
1762-1777;;[http://amzn.to/2i2zo3I]}}
{{12;;A Woman in Berlin;7/11/06;[http://amzn.to/2i472WZ]}}
{{13;6;Spring Boot in Action;1/3/16;[http://amzn.to/2hCPktW]}}
{{14;6;Spring in Action: Covers Spring 4;11/28/14;[http://amzn.to/2yJLyCk]}}
{{15;7;Soft Skills: The software developer's life 
manual;12/29/14;[http://amzn.to/2zNnSyn]}}
{{16;8;Of Mice and Men;;[http://amzn.to/2zJjXoc]}}
{{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
programming*;8/28/14;[http://amzn.to/2isdqoL]}}
{{18;12;Hamlet;6/8/12;[http://amzn.to/2yRbewY]}}
{{19;13;Pensées;12/31/1670;[http://amzn.to/2jweHOG]}}
{{20;14;*Fables choisies; mises en vers par M. de La 
Fontaine*;9/1/1999;[http://amzn.to/2yRH10W]}}
{{21;15;Discourse on Method and Meditations on First 
Philosophy;6/15/1999;[http://amzn.to/2hwB8zc]}}
{{22;12;Twelfth Night;7/1/4;[http://amzn.to/2zPYnwo]}}
{{23;12;Macbeth;7/1/3;[http://amzn.to/2zPYnwo]}}

And this code:

{{Dataset df = spark.read().format("csv")}}
 {{ .option("header", "true")}}
 {{ .option("multiline", true)}}
 {{ .option("sep", ";")}}
 {{ .option("quote", "*")}}
 {{ .option("dateFormat", "M/d/y")}}
 {{ .option("inferSchema", true)}}
 {{ .load("data/books.csv");}}
 {{df.show(7);}}
 {{df.printSchema();}}
h1. In Spark v2.0.1

{{Excerpt of the dataframe content:}}
 {{++---++---++}}
 {{| id|authorId| title|releaseDate| link|}}
 {{++---++---++}}
 {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
 {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
 {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
 {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
 {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
 {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}}
 {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}}
 {{++---++---++}}
 {{only showing top 7 rows}}{{Dataframe's schema:}}
 {{root}}
 \{{ |-- id: integer (nullable = true)}}
 \{{ |-- authorId: integer (nullable = true)}}
 \{{ |-- title: string (nullable = true)}}
 \{{ |-- releaseDate: string (nullable = true)}}
 \{{ |-- link: string (nullable = true)}}

*This is fine and the expected output*.
h1. Using Apache Spark v2.1.3

Excerpt of the dataframe content:

{{+-+---++---++}}
 \{{ | id|authorId| title|releaseDate| link|}}
 {{ 
+-+---++---++}}
 \{{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
 \{{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
 \{{ | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
 \{{ | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
 \{{ | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
 \{{ | 6| 2|Development Tools...| null| null|}}
 \{{ |An independent st...|12/28/16|http://amzn.to/2v...| null| null|}}
 {{ 
+-+---++---++}}
 \{{ only showing top 7 rows}}{{Dataframe's schema:}}
 \{{ root}}
 \{{ |-- id: string (nullable = true)}}
 \{{ |-- authorId: string (nullable = true)}}
 \{{ |-- title: string (nullable = true)}}
 \{{ |-- releaseDate: string (nullable = true)}}
 \{{ |-- link: string (nullable = true)}}

The *multiline* 

[jira] [Commented] (SPARK-26973) Kubernetes version support strategy on test nodes / backend

2019-02-22 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775435#comment-16775435
 ] 

Erik Erlandson commented on SPARK-26973:


A couple other points:
 * Currently, k8s is evolving in a manner where breakage of existing 
functionality is low probability, and so testing against the earliest version 
we wish to support is probably optimal in a scenario where we are choosing one 
version to test against. (This heuristic might change in the future, for 
example if k8s goes to a 2.x series where backward compatibility may be broken)
 * The integration testing was designed to support running against external 
clusters (GCP, etc) - this might provide an approach to supporting testing 
against multiple k8s versions. However, it would come with additional op-ex 
costs and decreased control over the environment. I mention it mostly because 
it's a plausible path to outsourcing some of the combinatorics that 
[~shaneknapp] discussed above

> Kubernetes version support strategy on test nodes / backend
> ---
>
> Key: SPARK-26973
> URL: https://issues.apache.org/jira/browse/SPARK-26973
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Kubernetes has a policy for supporting three minor releases and the current 
> ones are defined here: 
> [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]
> Moving from release 1.x to 1.(x+1) happens roughly every 100 
> days:[https://gravitational.com/blog/kubernetes-release-cycle.]
> This has an effect on dependencies upgrade at the Spark on K8s backend and 
> the version of Minikube required to be supported for testing. One other issue 
> is what the users actually want at the given time of a release. Some popular 
> vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap 
> for releases and may not catch up fast (what is our view on this).
> Follow the comments for a recent discussion on the topic: 
> [https://github.com/apache/spark/pull/23814.]
> Clearly we need a strategy for this.
> A couple of options for the current state of things:
> a) Support only the last two versions, but that leaves out a version that 
> still receives patches.
> b) Support only the latest, which makes testing easier, but leaves out other 
> currently maintained version.
> A good strategy will optimize at least the following:
> 1) percentage of users satisfied at release time.
> 2) how long it takes to support the latest K8s version
> 3) testing requirements eg. minikube versions used
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26974) Invalid data in grouped cached dataset, formed by joining a large cached dataset with a small dataset

2019-02-22 Thread Utkarsh Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Utkarsh Sharma updated SPARK-26974:
---
Description: 
The initial datasets are derived from hive tables using the spark.table() 
functions.

Dataset descriptions:

*+Sales+* dataset (close to 10 billion rows) with the following columns (and 
sample rows) : 
||ItemId (bigint)||CustomerId (bigint)||qty_sold (bigint)||
|1|1|20|
|1|2|30|
|2|1|40|

 

+*Customer*+ Dataset (close to 5 rows) with the following columns (and 
sample rows):
||CustomerId (bigint)||CustomerGrpNbr (smallint)||
|1|1|
|2|2|
|3|1|

 

I am doing the following steps:
 # Caching sales dataset with close to 10 billion rows.
 # Doing an inner join of 'sales' with 'customer' dataset
 
 # Doing group by on the resultant dataset, based on CustomerGrpNbr column to 
get sum(qty_sold) and stddev(qty_sold) vales in the customer groups.
 # Caching the resultant grouped dataset.
 # Doing a .count() on the grouped dataset.

The step 5 count is supposed to return only 20, because when you do a 
customer.select("CustomerGroupNbr").distinct().count you get 20 values. 
However, you get a value of around 65,000 in step 5.

Following are the commands I am running in spark-shell:
{code:java}
var sales = spark.table("sales_table")
var customer = spark.table("customer_table")
var finalDf = sales.join(customer, 
"CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), stddev("qty_sold"))
sales.cache()
finalDf.cache()
finalDf.count() // returns around 65k rows and the count keeps on varying each 
run
customer.select("CustomerGrpNbr").distinct().count() //returns 20{code}
I have been able to replicate the same behavior using the java api as well. 
This anamolous behavior disappears however, when I remove the caching 
statements. I.e. if i run the following in spark-shell, it works as expected:
{code:java}
var sales = spark.table("sales_table")
var customer = spark.table("customer_table")
var finalDf = sales.join(customer, 
"CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), 
stddev("qty_sold")) 
finalDf.count() // returns 20 
customer.select("CustomerGrpNbr").distinct().count() //returns 20
{code}
The tables in hive from which the datasets are built do not change during this 
entire process. So why does the caching cause this problem?

  was:
The initial datasets are derived from hive tables using the spark.table() 
functions.

Dataset descriptions:

*+Sales+* dataset (close to 10 billion rows) with the following columns (and 
sample rows) : 
||ItemId (bigint)||CustomerId (bigint)||qty_sold (bigint)||
|1|1|20|
|1|2|30|
|2|1|40|

 

+*Customer*+ Dataset (close to 5 rows) with the following columns (and 
sample rows):
||CustomerId (bigint)||CustomerGrpNbr (smallint)||
|1|1|
|2|2|
|3|1|

 

I am doing the following steps:
 # Caching sales dataset with close to 10 billion rows.
 # Doing an inner join of 'sales' with 'customer' dataset
 
 # Doing group by on the resultant dataset, based on CustomerGrpNbr column to 
get sum(qty_sold) and stddev(qty_sold) vales in the customer groups.
 # Caching the resultant grouped dataset.
 # Doing a .count() on the grouped dataset.

The step 5 count is supposed to return only 20, because when you do a 
customer.select("CustomerGroupNbr").distinct().count you get 20 values. 
However, you get a value of around 65,000 in step 5.

Following are the commands I am running in spark-shell:
{code:java}
var sales = spark.table("sales_table")
var customer = spark.table("customer_table")
var finalDf = sales.join(customer, 
"CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), stddev("qty_sold"))
sales.cache()
finalDf.cache()
finalDf.count() // returns around 65k rows and the count keeps on varying each 
// run
customer.select("CustomerGrpNbr").distinct().count() //returns 20{code}
I have been able to replicate the same behavior using the java api as well. 
This anamolous behavior disappears however, when I remove the caching 
statements. I.e. if i run the following in spark-shell, it works as expected:
{code:java}
var sales = spark.table("sales_table")
var customer = spark.table("customer_table")
var finalDf = sales.join(customer, 
"CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), 
stddev("qty_sold")) 
finalDf.count() // returns 20 
customer.select("CustomerGrpNbr").distinct().count() //returns 20
{code}
The tables in hive from which the datasets are built do not change during this 
entire process. So why does the caching cause this problem?


> Invalid data in grouped cached dataset, formed by joining a large cached 
> dataset with a small dataset
> -
>
> Key: SPARK-26974
> URL: https://issues.apache.org/jira/browse/SPARK-26974
> Project: Spark
>  Issue Type: Bug
>  Components: Java API

[jira] [Created] (SPARK-26974) Invalid data in grouped cached dataset, formed by joining a large cached dataset with a small dataset

2019-02-22 Thread Utkarsh Sharma (JIRA)
Utkarsh Sharma created SPARK-26974:
--

 Summary: Invalid data in grouped cached dataset, formed by joining 
a large cached dataset with a small dataset
 Key: SPARK-26974
 URL: https://issues.apache.org/jira/browse/SPARK-26974
 Project: Spark
  Issue Type: Bug
  Components: Java API, Spark Core, SQL
Affects Versions: 2.2.0
Reporter: Utkarsh Sharma


The initial datasets are derived from hive tables using the spark.table() 
functions.

Dataset descriptions:

*+Sales+* dataset (close to 10 billion rows) with the following columns (and 
sample rows) : 
||ItemId (bigint)||CustomerId (bigint)||qty_sold (bigint)||
|1|1|20|
|1|2|30|
|2|1|40|

 

+*Customer*+ Dataset (close to 5 rows) with the following columns (and 
sample rows):
||CustomerId (bigint)||CustomerGrpNbr (smallint)||
|1|1|
|2|2|
|3|1|

 

I am doing the following steps:
 # Caching sales dataset with close to 10 billion rows.
 # Doing an inner join of 'sales' with 'customer' dataset
 
 # Doing group by on the resultant dataset, based on CustomerGrpNbr column to 
get sum(qty_sold) and stddev(qty_sold) vales in the customer groups.
 # Caching the resultant grouped dataset.
 # Doing a .count() on the grouped dataset.

The step 5 count is supposed to return only 20, because when you do a 
customer.select("CustomerGroupNbr").distinct().count you get 20 values. 
However, you get a value of around 65,000 in step 5.

Following are the commands I am running in spark-shell:
{code:java}
var sales = spark.table("sales_table")
var customer = spark.table(“customer_table”)
var finalDf = sales.join(customer, 
"CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), stddev("qty_sold"))
sales.cache()
finalDf.cache()
finalDf.count() // returns around 65k rows and the count keeps on varying each 
// run
customer.select("CustomerGrpNbr").distinct().count() //returns 20{code}
I have been able to replicate the same behavior using the java api as well. 
This anamolous behavior disappears however, when I remove the caching 
statements. I.e. if i run the following in spark-shell, it works as expected:
{code:java}
var sales = spark.table("sales_table")
var customer = spark.table(“customer_table”)
var finalDf = sales.join(customer, 
"CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), 
stddev("qty_sold")) 
finalDf.count() // returns 20 
customer.select("CustomerGrpNbr").distinct().count() //returns 20
{code}
The tables in hive from which the datasets are built do not change during this 
entire process. So why does the caching cause this problem?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26973) Kubernetes version support strategy on test nodes / backend

2019-02-22 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775394#comment-16775394
 ] 

shane knapp commented on SPARK-26973:
-

i was chatting over email w/[~eje] about this yesterday, and the TL;DR is:  
only one version to test against, please!

here are some bullet points, in no particular order, to summarize what ~[~eje] 
and i discussed:
* we can easily test against any version of k8s via the 
{{--kubernetes-version}} flag passed to {{minikube start}}, so testing against 
N versions shouldn't be hard. 
* there is a moving range of k8s versions that a specific minikube release can 
support (ie:  minikube v.0.23.0 only supports up to k8s 1.13.1).  
* we are limited to *one* k8s/minikube build per node at any time, so adding 
tests for more than one k8s version to the suite will definitely increase 
resource contention.  currently spark is the only minikube consumer, but some 
upcoming lab projects will need their own k8s integration tests.
* the operational overhead of managing minikube, k8s and all of the VM-layer 
drivers is highly non-trivial.



> Kubernetes version support strategy on test nodes / backend
> ---
>
> Key: SPARK-26973
> URL: https://issues.apache.org/jira/browse/SPARK-26973
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Kubernetes has a policy for supporting three minor releases and the current 
> ones are defined here: 
> [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]
> Moving from release 1.x to 1.(x+1) happens roughly every 100 
> days:[https://gravitational.com/blog/kubernetes-release-cycle.]
> This has an effect on dependencies upgrade at the Spark on K8s backend and 
> the version of Minikube required to be supported for testing. One other issue 
> is what the users actually want at the given time of a release. Some popular 
> vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap 
> for releases and may not catch up fast (what is our view on this).
> Follow the comments for a recent discussion on the topic: 
> [https://github.com/apache/spark/pull/23814.]
> Clearly we need a strategy for this.
> A couple of options for the current state of things:
> a) Support only the last two versions, but that leaves out a version that 
> still receives patches.
> b) Support only the latest, which makes testing easier, but leaves out other 
> currently maintained version.
> A good strategy will optimize at least the following:
> 1) percentage of users satisfied at release time.
> 2) how long it takes to support the latest K8s version
> 3) testing requirements eg. minikube versions used
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26974) Invalid data in grouped cached dataset, formed by joining a large cached dataset with a small dataset

2019-02-22 Thread Utkarsh Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Utkarsh Sharma updated SPARK-26974:
---
Description: 
The initial datasets are derived from hive tables using the spark.table() 
functions.

Dataset descriptions:

*+Sales+* dataset (close to 10 billion rows) with the following columns (and 
sample rows) : 
||ItemId (bigint)||CustomerId (bigint)||qty_sold (bigint)||
|1|1|20|
|1|2|30|
|2|1|40|

 

+*Customer*+ Dataset (close to 5 rows) with the following columns (and 
sample rows):
||CustomerId (bigint)||CustomerGrpNbr (smallint)||
|1|1|
|2|2|
|3|1|

 

I am doing the following steps:
 # Caching sales dataset with close to 10 billion rows.
 # Doing an inner join of 'sales' with 'customer' dataset
 
 # Doing group by on the resultant dataset, based on CustomerGrpNbr column to 
get sum(qty_sold) and stddev(qty_sold) vales in the customer groups.
 # Caching the resultant grouped dataset.
 # Doing a .count() on the grouped dataset.

The step 5 count is supposed to return only 20, because when you do a 
customer.select("CustomerGroupNbr").distinct().count you get 20 values. 
However, you get a value of around 65,000 in step 5.

Following are the commands I am running in spark-shell:
{code:java}
var sales = spark.table("sales_table")
var customer = spark.table("customer_table")
var finalDf = sales.join(customer, 
"CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), stddev("qty_sold"))
sales.cache()
finalDf.cache()
finalDf.count() // returns around 65k rows and the count keeps on varying each 
// run
customer.select("CustomerGrpNbr").distinct().count() //returns 20{code}
I have been able to replicate the same behavior using the java api as well. 
This anamolous behavior disappears however, when I remove the caching 
statements. I.e. if i run the following in spark-shell, it works as expected:
{code:java}
var sales = spark.table("sales_table")
var customer = spark.table("customer_table")
var finalDf = sales.join(customer, 
"CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), 
stddev("qty_sold")) 
finalDf.count() // returns 20 
customer.select("CustomerGrpNbr").distinct().count() //returns 20
{code}
The tables in hive from which the datasets are built do not change during this 
entire process. So why does the caching cause this problem?

  was:
The initial datasets are derived from hive tables using the spark.table() 
functions.

Dataset descriptions:

*+Sales+* dataset (close to 10 billion rows) with the following columns (and 
sample rows) : 
||ItemId (bigint)||CustomerId (bigint)||qty_sold (bigint)||
|1|1|20|
|1|2|30|
|2|1|40|

 

+*Customer*+ Dataset (close to 5 rows) with the following columns (and 
sample rows):
||CustomerId (bigint)||CustomerGrpNbr (smallint)||
|1|1|
|2|2|
|3|1|

 

I am doing the following steps:
 # Caching sales dataset with close to 10 billion rows.
 # Doing an inner join of 'sales' with 'customer' dataset
 
 # Doing group by on the resultant dataset, based on CustomerGrpNbr column to 
get sum(qty_sold) and stddev(qty_sold) vales in the customer groups.
 # Caching the resultant grouped dataset.
 # Doing a .count() on the grouped dataset.

The step 5 count is supposed to return only 20, because when you do a 
customer.select("CustomerGroupNbr").distinct().count you get 20 values. 
However, you get a value of around 65,000 in step 5.

Following are the commands I am running in spark-shell:
{code:java}
var sales = spark.table("sales_table")
var customer = spark.table(“customer_table”)
var finalDf = sales.join(customer, 
"CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), stddev("qty_sold"))
sales.cache()
finalDf.cache()
finalDf.count() // returns around 65k rows and the count keeps on varying each 
// run
customer.select("CustomerGrpNbr").distinct().count() //returns 20{code}
I have been able to replicate the same behavior using the java api as well. 
This anamolous behavior disappears however, when I remove the caching 
statements. I.e. if i run the following in spark-shell, it works as expected:
{code:java}
var sales = spark.table("sales_table")
var customer = spark.table(“customer_table”)
var finalDf = sales.join(customer, 
"CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), 
stddev("qty_sold")) 
finalDf.count() // returns 20 
customer.select("CustomerGrpNbr").distinct().count() //returns 20
{code}
The tables in hive from which the datasets are built do not change during this 
entire process. So why does the caching cause this problem?


> Invalid data in grouped cached dataset, formed by joining a large cached 
> dataset with a small dataset
> -
>
> Key: SPARK-26974
> URL: https://issues.apache.org/jira/browse/SPARK-26974
> Project: Spark
>  Issue Type: Bug
>  Components: Java 

[jira] [Commented] (SPARK-26973) Kubernetes version support strategy on test nodes / backend

2019-02-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775355#comment-16775355
 ] 

Sean Owen commented on SPARK-26973:
---

I think I'd suggest testing against one version or else this could get 
complicated fast. The latest version we support is a good place to start. How 
about that until something tells us we miss too many big problems without more 
tests?

> Kubernetes version support strategy on test nodes / backend
> ---
>
> Key: SPARK-26973
> URL: https://issues.apache.org/jira/browse/SPARK-26973
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Kubernetes has a policy for supporting three minor releases and the current 
> ones are defined here: 
> [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]
> Moving from release 1.x to 1.(x+1) happens roughly every 100 
> days:[https://gravitational.com/blog/kubernetes-release-cycle.]
> This has an effect on dependencies upgrade at the Spark on K8s backend and 
> the version of Minikube required to be supported for testing. One other issue 
> is what the users actually want at the given time of a release. Some popular 
> vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap 
> for releases and may not catch up fast (what is our view on this).
> Follow the comments for a recent discussion on the topic: 
> [https://github.com/apache/spark/pull/23814.]
> Clearly we need a strategy for this.
> A couple of options for the current state of things:
> a) Support only the last two versions, but that leaves out a version that 
> still receives patches.
> b) Support only the latest, which makes testing easier, but leaves out other 
> currently maintained version.
> A good strategy will optimize at least the following:
> 1) percentage of users satisfied at release time.
> 2) how long it takes to support the latest K8s version
> 3) testing requirements eg. minikube versions used
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26973) Kubernetes version support strategy on test nodes / backend

2019-02-22 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-26973:

Description: 
Kubernetes has a policy for supporting three minor releases and the current 
ones are defined here: 
[https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]

Moving from release 1.x to 1.(x+1) happens roughly every 100 
days:[https://gravitational.com/blog/kubernetes-release-cycle.]

This has an effect on dependencies upgrade at the Spark on K8s backend and the 
version of Minikube required to be supported for testing. One other issue is 
what the users actually want at the given time of a release. Some popular 
vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for 
releases and may not catch up fast (what is our view on this).

Follow the comments for a recent discussion on the topic: 
[https://github.com/apache/spark/pull/23814.]

Clearly we need a strategy for this.

A couple of options for the current state of things:

a) Support only the last two versions, but that leaves out a version that still 
receives patches.

b) Support only the latest, which makes testing easier, but leaves out other 
currently maintained version.

A good strategy will optimize at least the following:

1) percentage of users satisfied at release time.

2) how long it takes to support the latest K8s version

3) testing requirements eg. minikube versions used

 

  was:
Kubernetes has a policy for supporting three minor releases and the current 
ones are defined here: 
[https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]

Moving from release 1.x to 1.(x+1) happens roughly every 100 
days:[https://gravitational.com/blog/kubernetes-release-cycle.]

This has an effect on dependencies upgrade at the Spark on K8s backend and the 
version of Minikube required to be supported for testing. One other issue is 
what the users actually want at the given time of a release. Some popular 
vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for 
releases and may not catch up fast (what is our view for this).

Follow the comments for a recent discussion on the topic: 
[https://github.com/apache/spark/pull/23814.]

Clearly we need a strategy for this.

A couple of options for the current state of things:

a) Support only the last two versions, but that leaves out a version that still 
receives patches.

b) Support only the latest, which makes testing easier, but leaves out other 
currently maintained version.

A good strategy will optimize at least the following:

1) percentage of users satisfied at release time.

2) how long it takes to support the latest K8s version

3) testing requirements eg. minikube versions used

 


> Kubernetes version support strategy on test nodes / backend
> ---
>
> Key: SPARK-26973
> URL: https://issues.apache.org/jira/browse/SPARK-26973
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Kubernetes has a policy for supporting three minor releases and the current 
> ones are defined here: 
> [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]
> Moving from release 1.x to 1.(x+1) happens roughly every 100 
> days:[https://gravitational.com/blog/kubernetes-release-cycle.]
> This has an effect on dependencies upgrade at the Spark on K8s backend and 
> the version of Minikube required to be supported for testing. One other issue 
> is what the users actually want at the given time of a release. Some popular 
> vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap 
> for releases and may not catch up fast (what is our view on this).
> Follow the comments for a recent discussion on the topic: 
> [https://github.com/apache/spark/pull/23814.]
> Clearly we need a strategy for this.
> A couple of options for the current state of things:
> a) Support only the last two versions, but that leaves out a version that 
> still receives patches.
> b) Support only the latest, which makes testing easier, but leaves out other 
> currently maintained version.
> A good strategy will optimize at least the following:
> 1) percentage of users satisfied at release time.
> 2) how long it takes to support the latest K8s version
> 3) testing requirements eg. minikube versions used
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-

[jira] [Updated] (SPARK-26973) Kubernetes version support strategy on test nodes / backend

2019-02-22 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-26973:

Description: 
Kubernetes has a policy for supporting three minor releases and the current 
ones are defined here: 
[https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]

Moving from release 1.x to 1.(x+1) happens roughly every 100 
days:[https://gravitational.com/blog/kubernetes-release-cycle.]

This has an effect on dependencies upgrade at the Spark on K8s backend and the 
version of Minikube required to be supported for testing. One other issue is 
what the users actually want at the given time of a release. Some popular 
vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for 
releases and may not catch up fast (what is our view for this).

Follow the comments for a recent discussion on the topic: 
[https://github.com/apache/spark/pull/23814.]

Clearly we need a strategy for this.

A couple of options for the current state of things:

a) Support only the last two versions, but that leaves out a version that still 
receives patches.

b) Support only the latest, which makes testing easier, but leaves out other 
currently maintained version.

A good strategy will optimize at least the following:

1) percentage of users satisfied at release time.

2) how long it takes to support the latest K8s version

3) testing requirements eg. minikube versions used

 

  was:
Kubernetes has a policy for supporting three minor releases and the current 
ones are defined here: 
[https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]

Moving from release 1.x to 1.(x+1) happens roughly every 100 
days:[https://gravitational.com/blog/kubernetes-release-cycle.]

This has an effect on dependencies upgrade at the Spark on K8s backend and the 
version of Minikube required to be supported for testing. One other issue is 
what the users actually want at the given time of a release. Some popular 
vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for 
releases and may not catch up fast.

Follow the comments for a recent discussion on the topic: 
[https://github.com/apache/spark/pull/23814.]

Clearly we need a strategy for this.

A couple of options for the current state of things:

a) Support only the last two versions, but that leaves out a version that still 
receives patches.

b) Support only the latest, which makes testing easier, but leaves out other 
currently maintained version.

A good strategy will optimize at least the following:

1) percentage of users satisfied at release time.

2) how long it takes to support the latest K8s version

3) testing requirements eg. minikube versions used

 


> Kubernetes version support strategy on test nodes / backend
> ---
>
> Key: SPARK-26973
> URL: https://issues.apache.org/jira/browse/SPARK-26973
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Kubernetes has a policy for supporting three minor releases and the current 
> ones are defined here: 
> [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]
> Moving from release 1.x to 1.(x+1) happens roughly every 100 
> days:[https://gravitational.com/blog/kubernetes-release-cycle.]
> This has an effect on dependencies upgrade at the Spark on K8s backend and 
> the version of Minikube required to be supported for testing. One other issue 
> is what the users actually want at the given time of a release. Some popular 
> vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap 
> for releases and may not catch up fast (what is our view for this).
> Follow the comments for a recent discussion on the topic: 
> [https://github.com/apache/spark/pull/23814.]
> Clearly we need a strategy for this.
> A couple of options for the current state of things:
> a) Support only the last two versions, but that leaves out a version that 
> still receives patches.
> b) Support only the latest, which makes testing easier, but leaves out other 
> currently maintained version.
> A good strategy will optimize at least the following:
> 1) percentage of users satisfied at release time.
> 2) how long it takes to support the latest K8s version
> 3) testing requirements eg. minikube versions used
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To un

[jira] [Updated] (SPARK-26973) Kubernetes version support strategy on test nodes / backend

2019-02-22 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-26973:

Description: 
Kubernetes has a policy for supporting three minor releases and the current 
ones are defined here: 
[https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]

Moving from release 1.x to 1.(x+1) happens roughly every 100 
days:[https://gravitational.com/blog/kubernetes-release-cycle.]

This has an effect on dependencies upgrade at the Spark on K8s backend and the 
version of Minikube required to be supported for testing. One other issue is 
what the users actually want at the given time of a release. Some popular 
vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for 
releases and may not catch up fast.

Follow the comments for a recent discussion on the topic: 
[https://github.com/apache/spark/pull/23814.]

Clearly we need a strategy for this.

A couple of options for the current state of things:

a) Support only the last two version, but that leaves out a version that still 
receives patches.

b) Support only the latest, which makes testing easier, but leaves out other 
currently maintained version.

A good strategy will optimize the following:

1) percentage of users satisfied at release time.

2) how long it takes to support the latest K8s version

3) testing requirements eg. minikube versions used

 

  was:
Kubernetes has a policy for supporting three minor releases and the current 
ones are defined here: 
[https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]

Moving from release 1.x to 1.(x+1) happens roughly every 100 
days:[https://gravitational.com/blog/kubernetes-release-cycle.]

This has an effect on dependencies upgrade at the Spark on K8s backend and the 
version of Minikube required to be supported for testing. One other issue is 
what the users actually want at the given time of a release. Some popular 
vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for 
releases and may not catch up fast.

Follow the comments a recent discussion on the topic: 
[https://github.com/apache/spark/pull/23814.]

Clearly we need a strategy for this.

A couple of options for the current state of things:

a) Support only the last two version, but that leaves out a version that still 
receives patches.

b) Support only the latest, which makes testing easier, but leaves out other 
currently maintained version.

A good strategy will optimize the following:

1) percentage of users satisfied at release time.

2) how long it takes to support the latest K8s version

3) testing requirements eg. minikube versions used

 


> Kubernetes version support strategy on test nodes / backend
> ---
>
> Key: SPARK-26973
> URL: https://issues.apache.org/jira/browse/SPARK-26973
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Kubernetes has a policy for supporting three minor releases and the current 
> ones are defined here: 
> [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]
> Moving from release 1.x to 1.(x+1) happens roughly every 100 
> days:[https://gravitational.com/blog/kubernetes-release-cycle.]
> This has an effect on dependencies upgrade at the Spark on K8s backend and 
> the version of Minikube required to be supported for testing. One other issue 
> is what the users actually want at the given time of a release. Some popular 
> vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap 
> for releases and may not catch up fast.
> Follow the comments for a recent discussion on the topic: 
> [https://github.com/apache/spark/pull/23814.]
> Clearly we need a strategy for this.
> A couple of options for the current state of things:
> a) Support only the last two version, but that leaves out a version that 
> still receives patches.
> b) Support only the latest, which makes testing easier, but leaves out other 
> currently maintained version.
> A good strategy will optimize the following:
> 1) percentage of users satisfied at release time.
> 2) how long it takes to support the latest K8s version
> 3) testing requirements eg. minikube versions used
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: is

[jira] [Updated] (SPARK-26973) Kubernetes version support strategy on test nodes / backend

2019-02-22 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-26973:

Description: 
Kubernetes has a policy for supporting three minor releases and the current 
ones are defined here: 
[https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]

Moving from release 1.x to 1.(x+1) happens roughly every 100 
days:[https://gravitational.com/blog/kubernetes-release-cycle.]

This has an effect on dependencies upgrade at the Spark on K8s backend and the 
version of Minikube required to be supported for testing. One other issue is 
what the users actually want at the given time of a release. Some popular 
vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for 
releases and may not catch up fast.

Follow the comments for a recent discussion on the topic: 
[https://github.com/apache/spark/pull/23814.]

Clearly we need a strategy for this.

A couple of options for the current state of things:

a) Support only the last two versions, but that leaves out a version that still 
receives patches.

b) Support only the latest, which makes testing easier, but leaves out other 
currently maintained version.

A good strategy will optimize at least the following:

1) percentage of users satisfied at release time.

2) how long it takes to support the latest K8s version

3) testing requirements eg. minikube versions used

 

  was:
Kubernetes has a policy for supporting three minor releases and the current 
ones are defined here: 
[https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]

Moving from release 1.x to 1.(x+1) happens roughly every 100 
days:[https://gravitational.com/blog/kubernetes-release-cycle.]

This has an effect on dependencies upgrade at the Spark on K8s backend and the 
version of Minikube required to be supported for testing. One other issue is 
what the users actually want at the given time of a release. Some popular 
vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for 
releases and may not catch up fast.

Follow the comments for a recent discussion on the topic: 
[https://github.com/apache/spark/pull/23814.]

Clearly we need a strategy for this.

A couple of options for the current state of things:

a) Support only the last two versions, but that leaves out a version that still 
receives patches.

b) Support only the latest, which makes testing easier, but leaves out other 
currently maintained version.

A good strategy will optimize the following:

1) percentage of users satisfied at release time.

2) how long it takes to support the latest K8s version

3) testing requirements eg. minikube versions used

 


> Kubernetes version support strategy on test nodes / backend
> ---
>
> Key: SPARK-26973
> URL: https://issues.apache.org/jira/browse/SPARK-26973
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Kubernetes has a policy for supporting three minor releases and the current 
> ones are defined here: 
> [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]
> Moving from release 1.x to 1.(x+1) happens roughly every 100 
> days:[https://gravitational.com/blog/kubernetes-release-cycle.]
> This has an effect on dependencies upgrade at the Spark on K8s backend and 
> the version of Minikube required to be supported for testing. One other issue 
> is what the users actually want at the given time of a release. Some popular 
> vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap 
> for releases and may not catch up fast.
> Follow the comments for a recent discussion on the topic: 
> [https://github.com/apache/spark/pull/23814.]
> Clearly we need a strategy for this.
> A couple of options for the current state of things:
> a) Support only the last two versions, but that leaves out a version that 
> still receives patches.
> b) Support only the latest, which makes testing easier, but leaves out other 
> currently maintained version.
> A good strategy will optimize at least the following:
> 1) percentage of users satisfied at release time.
> 2) how long it takes to support the latest K8s version
> 3) testing requirements eg. minikube versions used
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additi

[jira] [Updated] (SPARK-26973) Kubernetes version support strategy on test nodes / backend

2019-02-22 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-26973:

Description: 
Kubernetes has a policy for supporting three minor releases and the current 
ones are defined here: 
[https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]

Moving from release 1.x to 1.(x+1) happens roughly every 100 
days:[https://gravitational.com/blog/kubernetes-release-cycle.]

This has an effect on dependencies upgrade at the Spark on K8s backend and the 
version of Minikube required to be supported for testing. One other issue is 
what the users actually want at the given time of a release. Some popular 
vendors like EKS([https://aws.amazon.com/eks/faqs/] have their own roadmap for 
releases.

Follow the comments a recent discussion on the topic: 
[https://github.com/apache/spark/pull/23814.]

Clearly we need a strategy for this.

A couple of options for the current state of things:

a) Support only the last two version, but that leaves out a version that still 
receives patches.

b) Support only the latest, which makes testing easier, but leaves out other 
currently maintained version.

A good strategy will optimize the following:

1) percentage of users satisfied at release time.

2) how long it takes to support the latest K8s version

3) testing requirements eg. minikube versions used

 

  was:
Kubernetes has a policy for supporting three minor releases and the current are 
defined here: 
[https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md].]

Moving from release 1.x to 1.(x+1) happens roughly every 100 
days:[https://gravitational.com/blog/kubernetes-release-cycle.]

This has an effect on dependencies upgrade at the Spark on K8s backend and the 
version of Minikube required to be supported for testing. One other issue is 
what the users actually want at the given time of a release. Some popular 
vendors like EKS([https://aws.amazon.com/eks/faqs/] have their own roadmap for 
releases.

Follow the comments a recent discussion on the topic: 
[https://github.com/apache/spark/pull/23814.]

Clearly we need a strategy for this.

A couple of options for the current state of things:

a) Support only the last two version, but that leaves out a version that still 
receives patches.

b) Support only the latest, which makes testing easier, but leaves out other 
currently maintained version.

A good strategy will optimize the following:

1) percentage of users satisfied at release time.

2) how long it takes to support the latest K8s version

3) testing requirements eg. minikube versions used

 


> Kubernetes version support strategy on test nodes / backend
> ---
>
> Key: SPARK-26973
> URL: https://issues.apache.org/jira/browse/SPARK-26973
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Kubernetes has a policy for supporting three minor releases and the current 
> ones are defined here: 
> [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]
> Moving from release 1.x to 1.(x+1) happens roughly every 100 
> days:[https://gravitational.com/blog/kubernetes-release-cycle.]
> This has an effect on dependencies upgrade at the Spark on K8s backend and 
> the version of Minikube required to be supported for testing. One other issue 
> is what the users actually want at the given time of a release. Some popular 
> vendors like EKS([https://aws.amazon.com/eks/faqs/] have their own roadmap 
> for releases.
> Follow the comments a recent discussion on the topic: 
> [https://github.com/apache/spark/pull/23814.]
> Clearly we need a strategy for this.
> A couple of options for the current state of things:
> a) Support only the last two version, but that leaves out a version that 
> still receives patches.
> b) Support only the latest, which makes testing easier, but leaves out other 
> currently maintained version.
> A good strategy will optimize the following:
> 1) percentage of users satisfied at release time.
> 2) how long it takes to support the latest K8s version
> 3) testing requirements eg. minikube versions used
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26973) Kubernetes version support strategy on test nodes / backend

2019-02-22 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-26973:

Description: 
Kubernetes has a policy for supporting three minor releases and the current 
ones are defined here: 
[https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]

Moving from release 1.x to 1.(x+1) happens roughly every 100 
days:[https://gravitational.com/blog/kubernetes-release-cycle.]

This has an effect on dependencies upgrade at the Spark on K8s backend and the 
version of Minikube required to be supported for testing. One other issue is 
what the users actually want at the given time of a release. Some popular 
vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for 
releases and may not catch up fast.

Follow the comments for a recent discussion on the topic: 
[https://github.com/apache/spark/pull/23814.]

Clearly we need a strategy for this.

A couple of options for the current state of things:

a) Support only the last two versions, but that leaves out a version that still 
receives patches.

b) Support only the latest, which makes testing easier, but leaves out other 
currently maintained version.

A good strategy will optimize the following:

1) percentage of users satisfied at release time.

2) how long it takes to support the latest K8s version

3) testing requirements eg. minikube versions used

 

  was:
Kubernetes has a policy for supporting three minor releases and the current 
ones are defined here: 
[https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]

Moving from release 1.x to 1.(x+1) happens roughly every 100 
days:[https://gravitational.com/blog/kubernetes-release-cycle.]

This has an effect on dependencies upgrade at the Spark on K8s backend and the 
version of Minikube required to be supported for testing. One other issue is 
what the users actually want at the given time of a release. Some popular 
vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for 
releases and may not catch up fast.

Follow the comments for a recent discussion on the topic: 
[https://github.com/apache/spark/pull/23814.]

Clearly we need a strategy for this.

A couple of options for the current state of things:

a) Support only the last two version, but that leaves out a version that still 
receives patches.

b) Support only the latest, which makes testing easier, but leaves out other 
currently maintained version.

A good strategy will optimize the following:

1) percentage of users satisfied at release time.

2) how long it takes to support the latest K8s version

3) testing requirements eg. minikube versions used

 


> Kubernetes version support strategy on test nodes / backend
> ---
>
> Key: SPARK-26973
> URL: https://issues.apache.org/jira/browse/SPARK-26973
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Kubernetes has a policy for supporting three minor releases and the current 
> ones are defined here: 
> [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]
> Moving from release 1.x to 1.(x+1) happens roughly every 100 
> days:[https://gravitational.com/blog/kubernetes-release-cycle.]
> This has an effect on dependencies upgrade at the Spark on K8s backend and 
> the version of Minikube required to be supported for testing. One other issue 
> is what the users actually want at the given time of a release. Some popular 
> vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap 
> for releases and may not catch up fast.
> Follow the comments for a recent discussion on the topic: 
> [https://github.com/apache/spark/pull/23814.]
> Clearly we need a strategy for this.
> A couple of options for the current state of things:
> a) Support only the last two versions, but that leaves out a version that 
> still receives patches.
> b) Support only the latest, which makes testing easier, but leaves out other 
> currently maintained version.
> A good strategy will optimize the following:
> 1) percentage of users satisfied at release time.
> 2) how long it takes to support the latest K8s version
> 3) testing requirements eg. minikube versions used
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-ma

[jira] [Updated] (SPARK-26973) Kubernetes version support strategy on test nodes / backend

2019-02-22 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-26973:

Description: 
Kubernetes has a policy for supporting three minor releases and the current 
ones are defined here: 
[https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]

Moving from release 1.x to 1.(x+1) happens roughly every 100 
days:[https://gravitational.com/blog/kubernetes-release-cycle.]

This has an effect on dependencies upgrade at the Spark on K8s backend and the 
version of Minikube required to be supported for testing. One other issue is 
what the users actually want at the given time of a release. Some popular 
vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for 
releases and may not catch up fast.

Follow the comments a recent discussion on the topic: 
[https://github.com/apache/spark/pull/23814.]

Clearly we need a strategy for this.

A couple of options for the current state of things:

a) Support only the last two version, but that leaves out a version that still 
receives patches.

b) Support only the latest, which makes testing easier, but leaves out other 
currently maintained version.

A good strategy will optimize the following:

1) percentage of users satisfied at release time.

2) how long it takes to support the latest K8s version

3) testing requirements eg. minikube versions used

 

  was:
Kubernetes has a policy for supporting three minor releases and the current 
ones are defined here: 
[https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]

Moving from release 1.x to 1.(x+1) happens roughly every 100 
days:[https://gravitational.com/blog/kubernetes-release-cycle.]

This has an effect on dependencies upgrade at the Spark on K8s backend and the 
version of Minikube required to be supported for testing. One other issue is 
what the users actually want at the given time of a release. Some popular 
vendors like EKS([https://aws.amazon.com/eks/faqs/] have their own roadmap for 
releases.

Follow the comments a recent discussion on the topic: 
[https://github.com/apache/spark/pull/23814.]

Clearly we need a strategy for this.

A couple of options for the current state of things:

a) Support only the last two version, but that leaves out a version that still 
receives patches.

b) Support only the latest, which makes testing easier, but leaves out other 
currently maintained version.

A good strategy will optimize the following:

1) percentage of users satisfied at release time.

2) how long it takes to support the latest K8s version

3) testing requirements eg. minikube versions used

 


> Kubernetes version support strategy on test nodes / backend
> ---
>
> Key: SPARK-26973
> URL: https://issues.apache.org/jira/browse/SPARK-26973
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Kubernetes has a policy for supporting three minor releases and the current 
> ones are defined here: 
> [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md]
> Moving from release 1.x to 1.(x+1) happens roughly every 100 
> days:[https://gravitational.com/blog/kubernetes-release-cycle.]
> This has an effect on dependencies upgrade at the Spark on K8s backend and 
> the version of Minikube required to be supported for testing. One other issue 
> is what the users actually want at the given time of a release. Some popular 
> vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap 
> for releases and may not catch up fast.
> Follow the comments a recent discussion on the topic: 
> [https://github.com/apache/spark/pull/23814.]
> Clearly we need a strategy for this.
> A couple of options for the current state of things:
> a) Support only the last two version, but that leaves out a version that 
> still receives patches.
> b) Support only the latest, which makes testing easier, but leaves out other 
> currently maintained version.
> A good strategy will optimize the following:
> 1) percentage of users satisfied at release time.
> 2) how long it takes to support the latest K8s version
> 3) testing requirements eg. minikube versions used
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26973) Kubernetes version support strategy on test nodes / backend

2019-02-22 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-26973:

Summary: Kubernetes version support strategy on test nodes / backend  (was: 
Kubernetes version support strategy on test nodes and for the backend)

> Kubernetes version support strategy on test nodes / backend
> ---
>
> Key: SPARK-26973
> URL: https://issues.apache.org/jira/browse/SPARK-26973
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Kubernetes has a policy for supporting three minor releases and the current 
> are defined here: 
> [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md].]
> Moving from release 1.x to 1.(x+1) happens roughly every 100 
> days:[https://gravitational.com/blog/kubernetes-release-cycle.]
> This has an effect on dependencies upgrade at the Spark on K8s backend and 
> the version of Minikube required to be supported for testing. One other issue 
> is what the users actually want at the given time of a release. Some popular 
> vendors like EKS([https://aws.amazon.com/eks/faqs/] have their own roadmap 
> for releases.
> Follow the comments a recent discussion on the topic: 
> [https://github.com/apache/spark/pull/23814.]
> Clearly we need a strategy for this.
> A couple of options for the current state of things:
> a) Support only the last two version, but that leaves out a version that 
> still receives patches.
> b) Support only the latest, which makes testing easier, but leaves out other 
> currently maintained version.
> A good strategy will optimize the following:
> 1) percentage of users satisfied at release time.
> 2) how long it takes to support the latest K8s version
> 3) testing requirements eg. minikube versions used
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26973) Kubernetes version support strategy on test nodes and for the backend

2019-02-22 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-26973:
---

 Summary: Kubernetes version support strategy on test nodes and for 
the backend
 Key: SPARK-26973
 URL: https://issues.apache.org/jira/browse/SPARK-26973
 Project: Spark
  Issue Type: Test
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Stavros Kontopoulos


Kubernetes has a policy for supporting three minor releases and the current are 
defined here: 
[https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md].]

Moving from release 1.x to 1.(x+1) happens roughly every 100 
days:[https://gravitational.com/blog/kubernetes-release-cycle.]

This has an effect on dependencies upgrade at the Spark on K8s backend and the 
version of Minikube required to be supported for testing. One other issue is 
what the users actually want at the given time of a release. Some popular 
vendors like EKS([https://aws.amazon.com/eks/faqs/] have their own roadmap for 
releases.

Follow the comments a recent discussion on the topic: 
[https://github.com/apache/spark/pull/23814.]

Clearly we need a strategy for this.

A couple of options for the current state of things:

a) Support only the last two version, but that leaves out a version that still 
receives patches.

b) Support only the latest, which makes testing easier, but leaves out other 
currently maintained version.

A good strategy will optimize the following:

1) percentage of users satisfied at release time.

2) how long it takes to support the latest K8s version

3) testing requirements eg. minikube versions used

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26973) Kubernetes version support strategy on test nodes and for the backend

2019-02-22 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775339#comment-16775339
 ] 

Stavros Kontopoulos commented on SPARK-26973:
-

[~foxish] [~srowen] [~shaneknapp] [~vanzin] fyi.

> Kubernetes version support strategy on test nodes and for the backend
> -
>
> Key: SPARK-26973
> URL: https://issues.apache.org/jira/browse/SPARK-26973
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Kubernetes has a policy for supporting three minor releases and the current 
> are defined here: 
> [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md].]
> Moving from release 1.x to 1.(x+1) happens roughly every 100 
> days:[https://gravitational.com/blog/kubernetes-release-cycle.]
> This has an effect on dependencies upgrade at the Spark on K8s backend and 
> the version of Minikube required to be supported for testing. One other issue 
> is what the users actually want at the given time of a release. Some popular 
> vendors like EKS([https://aws.amazon.com/eks/faqs/] have their own roadmap 
> for releases.
> Follow the comments a recent discussion on the topic: 
> [https://github.com/apache/spark/pull/23814.]
> Clearly we need a strategy for this.
> A couple of options for the current state of things:
> a) Support only the last two version, but that leaves out a version that 
> still receives patches.
> b) Support only the latest, which makes testing easier, but leaves out other 
> currently maintained version.
> A good strategy will optimize the following:
> 1) percentage of users satisfied at release time.
> 2) how long it takes to support the latest K8s version
> 3) testing requirements eg. minikube versions used
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20597) KafkaSourceProvider falls back on path as synonym for topic

2019-02-22 Thread Valeria Vasylieva (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775279#comment-16775279
 ] 

Valeria Vasylieva commented on SPARK-20597:
---

[~jlaskowski] I have added the PR for this issue, could you please look at it? 
Thank you.

> KafkaSourceProvider falls back on path as synonym for topic
> ---
>
> Key: SPARK-20597
> URL: https://issues.apache.org/jira/browse/SPARK-20597
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> # {{KafkaSourceProvider}} supports {{topic}} option that sets the Kafka topic 
> to save a DataFrame's rows to
> # {{KafkaSourceProvider}} can use {{topic}} column to assign rows to Kafka 
> topics for writing
> What seems a quite interesting option is to support {{start(path: String)}} 
> as the least precedence option in which {{path}} would designate the default 
> topic when no other options are used.
> {code}
> df.writeStream.format("kafka").start("topic")
> {code}
> See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/KafkaSourceProvider-Why-topic-option-and-column-without-reverting-to-path-as-the-least-priority-td21458.html
>  for discussion



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true

2019-02-22 Thread Jean Georges Perrin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean Georges Perrin updated SPARK-26972:

Attachment: ComplexCsvToDataframeApp.java

> Issue with CSV import and inferSchema set to true
> -
>
> Key: SPARK-26972
> URL: https://issues.apache.org/jira/browse/SPARK-26972
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.3, 2.3.3, 2.4.0
> Environment: Java 8/Scala 2.11/MacOs
>Reporter: Jean Georges Perrin
>Priority: Major
> Attachments: ComplexCsvToDataframeApp.java, 
> ComplexCsvToDataframeWithSchemaApp.java, issue.txt
>
>
>  
>  
> Issue with CSV import and inferSchema set to true.
> I found a few discrepencies while working with inferSchema set to true in CSV 
> ingestion.
> Given the following CSV:
> {{id;authorId;title;releaseDate;link}}
> {{1;1;Fantastic Beasts and Where to Find Them: The Original 
> Screenplay;11/18/16;http://amzn.to/2kup94P}}
> {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
> Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP}}
> {{3;1;*The Tales of Beedle the Bard, Standard Edition (Harry 
> Potter)*;12/4/08;http://amzn.to/2kYezqr}}
> {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition 
> (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n}}
> {{5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the 
> Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT}}
> {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language? }}
> {{An independent study by Jean Georges Perrin, IIUG Board 
> Member*;12/28/16;http://amzn.to/2vBxOe1}}
> {{7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav}}
> {{8;3;A Connecticut Yankee in King Arthur's 
> Court;6/17/17;http://amzn.to/2x1NuoD}}
> {{10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA}}
> {{11;4;Diderot Encyclopedia: The Complete Illustrations 
> 1762-1777;;http://amzn.to/2i2zo3I}}
> {{12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ}}
> {{13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW}}
> {{14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk}}
> {{15;7;Soft Skills: The software developer's life 
> manual;12/29/14;http://amzn.to/2zNnSyn}}
> {{16;8;Of Mice and Men;;http://amzn.to/2zJjXoc}}
> {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
> programming*;8/28/14;http://amzn.to/2isdqoL}}
> {{18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY}}
> {{19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG}}
> {{20;14;*Fables choisies; mises en vers par M. de La 
> Fontaine*;9/1/1999;http://amzn.to/2yRH10W}}
> {{21;15;Discourse on Method and Meditations on First 
> Philosophy;6/15/1999;http://amzn.to/2hwB8zc}}
> {{22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo}}
> {{23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo}}
> And this code:
> {{Dataset df = spark.read().format("csv")}}
> {{ .option("header", "true")}}
> {{ .option("multiline", true)}}
> {{ .option("sep", ";")}}
> {{ .option("quote", "*")}}
> {{ .option("dateFormat", "M/d/y")}}
> {{ .option("inferSchema", true)}}
> {{ .load("data/books.csv");}}
> {{df.show(7);}}
> {{df.printSchema();}}
> h1. In Spark v2.0.1
> {{Excerpt of the dataframe content:}}
> {{+---+++---++}}
> {{| id|authorId| title|releaseDate| link|}}
> {{+---+++---++}}
> {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
> {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
> {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
> {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
> {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
> {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}}
> {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}}
> {{+---+++---++}}
> {{only showing top 7 rows}}{{Dataframe's schema:}}
> {{root}}
> {{ |-- id: integer (nullable = true)}}
> {{ |-- authorId: integer (nullable = true)}}
> {{ |-- title: string (nullable = true)}}
> {{ |-- releaseDate: string (nullable = true)}}
> {{ |-- link: string (nullable = true)}}
> *This is fine and the expected output*.
> h1. Using Apache Spark v2.1.3
> Excerpt of the dataframe content:
> {{++++---++}}
> {{ | id|authorId| title|releaseDate| link|}}
> {{ 
> ++++---++}}
> {{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
> {{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
> {{ | 3| 1|The Tales of Beed...| 12/4/

[jira] [Created] (SPARK-26972) Issue with CSV import and inferSchema set to true

2019-02-22 Thread Jean Georges Perrin (JIRA)
Jean Georges Perrin created SPARK-26972:
---

 Summary: Issue with CSV import and inferSchema set to true
 Key: SPARK-26972
 URL: https://issues.apache.org/jira/browse/SPARK-26972
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.4.0, 2.3.3, 2.1.3
 Environment: Java 8/Scala 2.11/MacOs
Reporter: Jean Georges Perrin


 

 

Issue with CSV import and inferSchema set to true.

I found a few discrepencies while working with inferSchema set to true in CSV 
ingestion.

Given the following CSV:

{{id;authorId;title;releaseDate;link}}
{{1;1;Fantastic Beasts and Where to Find Them: The Original 
Screenplay;11/18/16;http://amzn.to/2kup94P}}
{{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP}}
{{3;1;*The Tales of Beedle the Bard, Standard Edition (Harry 
Potter)*;12/4/08;http://amzn.to/2kYezqr}}
{{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry 
Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n}}
{{5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the 
Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT}}
{{6;2;*Development Tools in 2006: any Room for a 4GL-style Language? }}
{{An independent study by Jean Georges Perrin, IIUG Board 
Member*;12/28/16;http://amzn.to/2vBxOe1}}
{{7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav}}
{{8;3;A Connecticut Yankee in King Arthur's 
Court;6/17/17;http://amzn.to/2x1NuoD}}
{{10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA}}
{{11;4;Diderot Encyclopedia: The Complete Illustrations 
1762-1777;;http://amzn.to/2i2zo3I}}
{{12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ}}
{{13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW}}
{{14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk}}
{{15;7;Soft Skills: The software developer's life 
manual;12/29/14;http://amzn.to/2zNnSyn}}
{{16;8;Of Mice and Men;;http://amzn.to/2zJjXoc}}
{{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
programming*;8/28/14;http://amzn.to/2isdqoL}}
{{18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY}}
{{19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG}}
{{20;14;*Fables choisies; mises en vers par M. de La 
Fontaine*;9/1/1999;http://amzn.to/2yRH10W}}
{{21;15;Discourse on Method and Meditations on First 
Philosophy;6/15/1999;http://amzn.to/2hwB8zc}}
{{22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo}}
{{23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo}}

And this code:

{{Dataset df = spark.read().format("csv")}}
{{ .option("header", "true")}}
{{ .option("multiline", true)}}
{{ .option("sep", ";")}}
{{ .option("quote", "*")}}
{{ .option("dateFormat", "M/d/y")}}
{{ .option("inferSchema", true)}}
{{ .load("data/books.csv");}}
{{df.show(7);}}
{{df.printSchema();}}
h1. In Spark v2.0.1

{{Excerpt of the dataframe content:}}
{{+---+++---++}}
{{| id|authorId| title|releaseDate| link|}}
{{+---+++---++}}
{{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
{{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
{{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
{{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
{{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
{{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}}
{{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}}
{{+---+++---++}}
{{only showing top 7 rows}}{{Dataframe's schema:}}
{{root}}
{{ |-- id: integer (nullable = true)}}
{{ |-- authorId: integer (nullable = true)}}
{{ |-- title: string (nullable = true)}}
{{ |-- releaseDate: string (nullable = true)}}
{{ |-- link: string (nullable = true)}}

*This is fine and the expected output*.
h1. Using Apache Spark v2.1.3

Excerpt of the dataframe content:

{{++++---++}}
{{ | id|authorId| title|releaseDate| link|}}
{{ 
++++---++}}
{{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
{{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
{{ | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
{{ | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
{{ | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
{{ | 6| 2|Development Tools...| null| null|}}
{{ |An independent st...|12/28/16|http://amzn.to/2v...| null| null|}}
{{ 
++++---++}}
{{ only showing top 7 rows}}{{Dataframe's schema:}}
{{ root}}
{{ |-- id: string (nullable = true)}}
{{ |-- authorId: string (nullable = true)}}
{{

[jira] [Commented] (SPARK-26972) Issue with CSV import and inferSchema set to true

2019-02-22 Thread Jean Georges Perrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775218#comment-16775218
 ] 

Jean Georges Perrin commented on SPARK-26972:
-

I added the code as attachments, Jira is breaking my formatting :(

> Issue with CSV import and inferSchema set to true
> -
>
> Key: SPARK-26972
> URL: https://issues.apache.org/jira/browse/SPARK-26972
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.3, 2.3.3, 2.4.0
> Environment: Java 8/Scala 2.11/MacOs
>Reporter: Jean Georges Perrin
>Priority: Major
> Attachments: ComplexCsvToDataframeApp.java, 
> ComplexCsvToDataframeWithSchemaApp.java, books.csv, issue.txt, pom.xml
>
>
>  
>  
> Issue with CSV import and inferSchema set to true.
> I found a few discrepencies while working with inferSchema set to true in CSV 
> ingestion.
> Given the following CSV:
> {{id;authorId;title;releaseDate;link}}
> {{1;1;Fantastic Beasts and Where to Find Them: The Original 
> Screenplay;11/18/16;http://amzn.to/2kup94P}}
> {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
> Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP}}
> {{3;1;*The Tales of Beedle the Bard, Standard Edition (Harry 
> Potter)*;12/4/08;http://amzn.to/2kYezqr}}
> {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition 
> (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n}}
> {{5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the 
> Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT}}
> {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language? }}
> {{An independent study by Jean Georges Perrin, IIUG Board 
> Member*;12/28/16;http://amzn.to/2vBxOe1}}
> {{7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav}}
> {{8;3;A Connecticut Yankee in King Arthur's 
> Court;6/17/17;http://amzn.to/2x1NuoD}}
> {{10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA}}
> {{11;4;Diderot Encyclopedia: The Complete Illustrations 
> 1762-1777;;http://amzn.to/2i2zo3I}}
> {{12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ}}
> {{13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW}}
> {{14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk}}
> {{15;7;Soft Skills: The software developer's life 
> manual;12/29/14;http://amzn.to/2zNnSyn}}
> {{16;8;Of Mice and Men;;http://amzn.to/2zJjXoc}}
> {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
> programming*;8/28/14;http://amzn.to/2isdqoL}}
> {{18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY}}
> {{19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG}}
> {{20;14;*Fables choisies; mises en vers par M. de La 
> Fontaine*;9/1/1999;http://amzn.to/2yRH10W}}
> {{21;15;Discourse on Method and Meditations on First 
> Philosophy;6/15/1999;http://amzn.to/2hwB8zc}}
> {{22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo}}
> {{23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo}}
> And this code:
> {{Dataset df = spark.read().format("csv")}}
> {{ .option("header", "true")}}
> {{ .option("multiline", true)}}
> {{ .option("sep", ";")}}
> {{ .option("quote", "*")}}
> {{ .option("dateFormat", "M/d/y")}}
> {{ .option("inferSchema", true)}}
> {{ .load("data/books.csv");}}
> {{df.show(7);}}
> {{df.printSchema();}}
> h1. In Spark v2.0.1
> {{Excerpt of the dataframe content:}}
> {{+---+++---++}}
> {{| id|authorId| title|releaseDate| link|}}
> {{+---+++---++}}
> {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
> {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
> {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
> {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
> {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
> {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}}
> {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}}
> {{+---+++---++}}
> {{only showing top 7 rows}}{{Dataframe's schema:}}
> {{root}}
> {{ |-- id: integer (nullable = true)}}
> {{ |-- authorId: integer (nullable = true)}}
> {{ |-- title: string (nullable = true)}}
> {{ |-- releaseDate: string (nullable = true)}}
> {{ |-- link: string (nullable = true)}}
> *This is fine and the expected output*.
> h1. Using Apache Spark v2.1.3
> Excerpt of the dataframe content:
> {{++++---++}}
> {{ | id|authorId| title|releaseDate| link|}}
> {{ 
> ++++---++}}
> {{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
> {{ |

[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true

2019-02-22 Thread Jean Georges Perrin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean Georges Perrin updated SPARK-26972:

Attachment: pom.xml

> Issue with CSV import and inferSchema set to true
> -
>
> Key: SPARK-26972
> URL: https://issues.apache.org/jira/browse/SPARK-26972
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.3, 2.3.3, 2.4.0
> Environment: Java 8/Scala 2.11/MacOs
>Reporter: Jean Georges Perrin
>Priority: Major
> Attachments: ComplexCsvToDataframeApp.java, 
> ComplexCsvToDataframeWithSchemaApp.java, books.csv, issue.txt, pom.xml
>
>
>  
>  
> Issue with CSV import and inferSchema set to true.
> I found a few discrepencies while working with inferSchema set to true in CSV 
> ingestion.
> Given the following CSV:
> {{id;authorId;title;releaseDate;link}}
> {{1;1;Fantastic Beasts and Where to Find Them: The Original 
> Screenplay;11/18/16;http://amzn.to/2kup94P}}
> {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
> Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP}}
> {{3;1;*The Tales of Beedle the Bard, Standard Edition (Harry 
> Potter)*;12/4/08;http://amzn.to/2kYezqr}}
> {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition 
> (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n}}
> {{5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the 
> Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT}}
> {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language? }}
> {{An independent study by Jean Georges Perrin, IIUG Board 
> Member*;12/28/16;http://amzn.to/2vBxOe1}}
> {{7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav}}
> {{8;3;A Connecticut Yankee in King Arthur's 
> Court;6/17/17;http://amzn.to/2x1NuoD}}
> {{10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA}}
> {{11;4;Diderot Encyclopedia: The Complete Illustrations 
> 1762-1777;;http://amzn.to/2i2zo3I}}
> {{12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ}}
> {{13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW}}
> {{14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk}}
> {{15;7;Soft Skills: The software developer's life 
> manual;12/29/14;http://amzn.to/2zNnSyn}}
> {{16;8;Of Mice and Men;;http://amzn.to/2zJjXoc}}
> {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
> programming*;8/28/14;http://amzn.to/2isdqoL}}
> {{18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY}}
> {{19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG}}
> {{20;14;*Fables choisies; mises en vers par M. de La 
> Fontaine*;9/1/1999;http://amzn.to/2yRH10W}}
> {{21;15;Discourse on Method and Meditations on First 
> Philosophy;6/15/1999;http://amzn.to/2hwB8zc}}
> {{22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo}}
> {{23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo}}
> And this code:
> {{Dataset df = spark.read().format("csv")}}
> {{ .option("header", "true")}}
> {{ .option("multiline", true)}}
> {{ .option("sep", ";")}}
> {{ .option("quote", "*")}}
> {{ .option("dateFormat", "M/d/y")}}
> {{ .option("inferSchema", true)}}
> {{ .load("data/books.csv");}}
> {{df.show(7);}}
> {{df.printSchema();}}
> h1. In Spark v2.0.1
> {{Excerpt of the dataframe content:}}
> {{+---+++---++}}
> {{| id|authorId| title|releaseDate| link|}}
> {{+---+++---++}}
> {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
> {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
> {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
> {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
> {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
> {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}}
> {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}}
> {{+---+++---++}}
> {{only showing top 7 rows}}{{Dataframe's schema:}}
> {{root}}
> {{ |-- id: integer (nullable = true)}}
> {{ |-- authorId: integer (nullable = true)}}
> {{ |-- title: string (nullable = true)}}
> {{ |-- releaseDate: string (nullable = true)}}
> {{ |-- link: string (nullable = true)}}
> *This is fine and the expected output*.
> h1. Using Apache Spark v2.1.3
> Excerpt of the dataframe content:
> {{++++---++}}
> {{ | id|authorId| title|releaseDate| link|}}
> {{ 
> ++++---++}}
> {{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
> {{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
> {{ | 3| 1|The Tales of Beed...| 12/4/08

[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true

2019-02-22 Thread Jean Georges Perrin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean Georges Perrin updated SPARK-26972:

Attachment: books.csv

> Issue with CSV import and inferSchema set to true
> -
>
> Key: SPARK-26972
> URL: https://issues.apache.org/jira/browse/SPARK-26972
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.3, 2.3.3, 2.4.0
> Environment: Java 8/Scala 2.11/MacOs
>Reporter: Jean Georges Perrin
>Priority: Major
> Attachments: ComplexCsvToDataframeApp.java, 
> ComplexCsvToDataframeWithSchemaApp.java, books.csv, issue.txt, pom.xml
>
>
>  
>  
> Issue with CSV import and inferSchema set to true.
> I found a few discrepencies while working with inferSchema set to true in CSV 
> ingestion.
> Given the following CSV:
> {{id;authorId;title;releaseDate;link}}
> {{1;1;Fantastic Beasts and Where to Find Them: The Original 
> Screenplay;11/18/16;http://amzn.to/2kup94P}}
> {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
> Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP}}
> {{3;1;*The Tales of Beedle the Bard, Standard Edition (Harry 
> Potter)*;12/4/08;http://amzn.to/2kYezqr}}
> {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition 
> (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n}}
> {{5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the 
> Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT}}
> {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language? }}
> {{An independent study by Jean Georges Perrin, IIUG Board 
> Member*;12/28/16;http://amzn.to/2vBxOe1}}
> {{7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav}}
> {{8;3;A Connecticut Yankee in King Arthur's 
> Court;6/17/17;http://amzn.to/2x1NuoD}}
> {{10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA}}
> {{11;4;Diderot Encyclopedia: The Complete Illustrations 
> 1762-1777;;http://amzn.to/2i2zo3I}}
> {{12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ}}
> {{13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW}}
> {{14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk}}
> {{15;7;Soft Skills: The software developer's life 
> manual;12/29/14;http://amzn.to/2zNnSyn}}
> {{16;8;Of Mice and Men;;http://amzn.to/2zJjXoc}}
> {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
> programming*;8/28/14;http://amzn.to/2isdqoL}}
> {{18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY}}
> {{19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG}}
> {{20;14;*Fables choisies; mises en vers par M. de La 
> Fontaine*;9/1/1999;http://amzn.to/2yRH10W}}
> {{21;15;Discourse on Method and Meditations on First 
> Philosophy;6/15/1999;http://amzn.to/2hwB8zc}}
> {{22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo}}
> {{23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo}}
> And this code:
> {{Dataset df = spark.read().format("csv")}}
> {{ .option("header", "true")}}
> {{ .option("multiline", true)}}
> {{ .option("sep", ";")}}
> {{ .option("quote", "*")}}
> {{ .option("dateFormat", "M/d/y")}}
> {{ .option("inferSchema", true)}}
> {{ .load("data/books.csv");}}
> {{df.show(7);}}
> {{df.printSchema();}}
> h1. In Spark v2.0.1
> {{Excerpt of the dataframe content:}}
> {{+---+++---++}}
> {{| id|authorId| title|releaseDate| link|}}
> {{+---+++---++}}
> {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
> {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
> {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
> {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
> {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
> {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}}
> {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}}
> {{+---+++---++}}
> {{only showing top 7 rows}}{{Dataframe's schema:}}
> {{root}}
> {{ |-- id: integer (nullable = true)}}
> {{ |-- authorId: integer (nullable = true)}}
> {{ |-- title: string (nullable = true)}}
> {{ |-- releaseDate: string (nullable = true)}}
> {{ |-- link: string (nullable = true)}}
> *This is fine and the expected output*.
> h1. Using Apache Spark v2.1.3
> Excerpt of the dataframe content:
> {{++++---++}}
> {{ | id|authorId| title|releaseDate| link|}}
> {{ 
> ++++---++}}
> {{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
> {{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
> {{ | 3| 1|The Tales of Beed...| 12/4/

[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true

2019-02-22 Thread Jean Georges Perrin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean Georges Perrin updated SPARK-26972:

Attachment: issue.txt

> Issue with CSV import and inferSchema set to true
> -
>
> Key: SPARK-26972
> URL: https://issues.apache.org/jira/browse/SPARK-26972
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.3, 2.3.3, 2.4.0
> Environment: Java 8/Scala 2.11/MacOs
>Reporter: Jean Georges Perrin
>Priority: Major
> Attachments: issue.txt
>
>
>  
>  
> Issue with CSV import and inferSchema set to true.
> I found a few discrepencies while working with inferSchema set to true in CSV 
> ingestion.
> Given the following CSV:
> {{id;authorId;title;releaseDate;link}}
> {{1;1;Fantastic Beasts and Where to Find Them: The Original 
> Screenplay;11/18/16;http://amzn.to/2kup94P}}
> {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
> Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP}}
> {{3;1;*The Tales of Beedle the Bard, Standard Edition (Harry 
> Potter)*;12/4/08;http://amzn.to/2kYezqr}}
> {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition 
> (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n}}
> {{5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the 
> Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT}}
> {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language? }}
> {{An independent study by Jean Georges Perrin, IIUG Board 
> Member*;12/28/16;http://amzn.to/2vBxOe1}}
> {{7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav}}
> {{8;3;A Connecticut Yankee in King Arthur's 
> Court;6/17/17;http://amzn.to/2x1NuoD}}
> {{10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA}}
> {{11;4;Diderot Encyclopedia: The Complete Illustrations 
> 1762-1777;;http://amzn.to/2i2zo3I}}
> {{12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ}}
> {{13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW}}
> {{14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk}}
> {{15;7;Soft Skills: The software developer's life 
> manual;12/29/14;http://amzn.to/2zNnSyn}}
> {{16;8;Of Mice and Men;;http://amzn.to/2zJjXoc}}
> {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
> programming*;8/28/14;http://amzn.to/2isdqoL}}
> {{18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY}}
> {{19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG}}
> {{20;14;*Fables choisies; mises en vers par M. de La 
> Fontaine*;9/1/1999;http://amzn.to/2yRH10W}}
> {{21;15;Discourse on Method and Meditations on First 
> Philosophy;6/15/1999;http://amzn.to/2hwB8zc}}
> {{22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo}}
> {{23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo}}
> And this code:
> {{Dataset df = spark.read().format("csv")}}
> {{ .option("header", "true")}}
> {{ .option("multiline", true)}}
> {{ .option("sep", ";")}}
> {{ .option("quote", "*")}}
> {{ .option("dateFormat", "M/d/y")}}
> {{ .option("inferSchema", true)}}
> {{ .load("data/books.csv");}}
> {{df.show(7);}}
> {{df.printSchema();}}
> h1. In Spark v2.0.1
> {{Excerpt of the dataframe content:}}
> {{+---+++---++}}
> {{| id|authorId| title|releaseDate| link|}}
> {{+---+++---++}}
> {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
> {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
> {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
> {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
> {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
> {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}}
> {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}}
> {{+---+++---++}}
> {{only showing top 7 rows}}{{Dataframe's schema:}}
> {{root}}
> {{ |-- id: integer (nullable = true)}}
> {{ |-- authorId: integer (nullable = true)}}
> {{ |-- title: string (nullable = true)}}
> {{ |-- releaseDate: string (nullable = true)}}
> {{ |-- link: string (nullable = true)}}
> *This is fine and the expected output*.
> h1. Using Apache Spark v2.1.3
> Excerpt of the dataframe content:
> {{++++---++}}
> {{ | id|authorId| title|releaseDate| link|}}
> {{ 
> ++++---++}}
> {{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
> {{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
> {{ | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
> {{ | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
> 

[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true

2019-02-22 Thread Jean Georges Perrin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean Georges Perrin updated SPARK-26972:

Attachment: ComplexCsvToDataframeWithSchemaApp.java

> Issue with CSV import and inferSchema set to true
> -
>
> Key: SPARK-26972
> URL: https://issues.apache.org/jira/browse/SPARK-26972
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.3, 2.3.3, 2.4.0
> Environment: Java 8/Scala 2.11/MacOs
>Reporter: Jean Georges Perrin
>Priority: Major
> Attachments: ComplexCsvToDataframeApp.java, 
> ComplexCsvToDataframeWithSchemaApp.java, issue.txt
>
>
>  
>  
> Issue with CSV import and inferSchema set to true.
> I found a few discrepencies while working with inferSchema set to true in CSV 
> ingestion.
> Given the following CSV:
> {{id;authorId;title;releaseDate;link}}
> {{1;1;Fantastic Beasts and Where to Find Them: The Original 
> Screenplay;11/18/16;http://amzn.to/2kup94P}}
> {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
> Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP}}
> {{3;1;*The Tales of Beedle the Bard, Standard Edition (Harry 
> Potter)*;12/4/08;http://amzn.to/2kYezqr}}
> {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition 
> (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n}}
> {{5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the 
> Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT}}
> {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language? }}
> {{An independent study by Jean Georges Perrin, IIUG Board 
> Member*;12/28/16;http://amzn.to/2vBxOe1}}
> {{7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav}}
> {{8;3;A Connecticut Yankee in King Arthur's 
> Court;6/17/17;http://amzn.to/2x1NuoD}}
> {{10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA}}
> {{11;4;Diderot Encyclopedia: The Complete Illustrations 
> 1762-1777;;http://amzn.to/2i2zo3I}}
> {{12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ}}
> {{13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW}}
> {{14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk}}
> {{15;7;Soft Skills: The software developer's life 
> manual;12/29/14;http://amzn.to/2zNnSyn}}
> {{16;8;Of Mice and Men;;http://amzn.to/2zJjXoc}}
> {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
> programming*;8/28/14;http://amzn.to/2isdqoL}}
> {{18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY}}
> {{19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG}}
> {{20;14;*Fables choisies; mises en vers par M. de La 
> Fontaine*;9/1/1999;http://amzn.to/2yRH10W}}
> {{21;15;Discourse on Method and Meditations on First 
> Philosophy;6/15/1999;http://amzn.to/2hwB8zc}}
> {{22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo}}
> {{23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo}}
> And this code:
> {{Dataset df = spark.read().format("csv")}}
> {{ .option("header", "true")}}
> {{ .option("multiline", true)}}
> {{ .option("sep", ";")}}
> {{ .option("quote", "*")}}
> {{ .option("dateFormat", "M/d/y")}}
> {{ .option("inferSchema", true)}}
> {{ .load("data/books.csv");}}
> {{df.show(7);}}
> {{df.printSchema();}}
> h1. In Spark v2.0.1
> {{Excerpt of the dataframe content:}}
> {{+---+++---++}}
> {{| id|authorId| title|releaseDate| link|}}
> {{+---+++---++}}
> {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
> {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
> {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
> {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
> {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
> {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}}
> {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}}
> {{+---+++---++}}
> {{only showing top 7 rows}}{{Dataframe's schema:}}
> {{root}}
> {{ |-- id: integer (nullable = true)}}
> {{ |-- authorId: integer (nullable = true)}}
> {{ |-- title: string (nullable = true)}}
> {{ |-- releaseDate: string (nullable = true)}}
> {{ |-- link: string (nullable = true)}}
> *This is fine and the expected output*.
> h1. Using Apache Spark v2.1.3
> Excerpt of the dataframe content:
> {{++++---++}}
> {{ | id|authorId| title|releaseDate| link|}}
> {{ 
> ++++---++}}
> {{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
> {{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
> {{ | 3| 1|The Tales of Beed

[jira] [Commented] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple ti

2019-02-22 Thread Parth Gandhi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775188#comment-16775188
 ] 

Parth Gandhi commented on SPARK-25250:
--

[~Ngone51] I understand that you had a proposal and we were actively discussing 
on various solutions in the PR #22806 , but however, I have been working on 
that PR tirelessly for a few months and we still have an ongoing discussion 
going on there. Any specific reasons as to why did you create your own PR for 
the same issue? WDYT [~irashid] [~cloud_fan] ?

> Race condition with tasks running when new attempt for same stage is created 
> leads to other task in the next attempt running on the same partition id 
> retry multiple times
> --
>
> Key: SPARK-25250
> URL: https://issues.apache.org/jira/browse/SPARK-25250
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.1
>Reporter: Parth Gandhi
>Priority: Major
>
> We recently had a scenario where a race condition occurred when a task from 
> previous stage attempt just finished before new attempt for the same stage 
> was created due to fetch failure, so the new task created in the second 
> attempt on the same partition id was retrying multiple times due to 
> TaskCommitDenied Exception without realizing that the task in earlier attempt 
> was already successful.  
> For example, consider a task with partition id 9000 and index 9000 running in 
> stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. 
> Just within this timespan, the above task completes successfully, thus, 
> marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has 
> not yet been created, the taskset info for that stage is not available to the 
> TaskScheduler so, naturally, the partition id 9000 has not been marked 
> completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same 
> partition id 9000. This task fails due to CommitDeniedException and since, it 
> does not see the corresponding partition id as been marked successful, it 
> keeps retrying multiple times until the job finally succeeds. It doesn't 
> cause any job failures because the DAG scheduler is tracking the partitions 
> separate from the task set managers.
>  
> Steps to Reproduce:
>  # Run any large job involving shuffle operation.
>  # When the ShuffleMap stage finishes and the ResultStage begins running, 
> cause this stage to throw a fetch failure exception(Try deleting certain 
> shuffle files on any host).
>  # Observe the task attempt numbers for the next stage attempt. Please note 
> that this issue is an intermittent one, so it might not happen all the time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26971) How to read delimiter (Cedilla) in spark RDD and Dataframes

2019-02-22 Thread Babu (JIRA)
Babu created SPARK-26971:


 Summary: How to read delimiter (Cedilla) in spark RDD and 
Dataframes
 Key: SPARK-26971
 URL: https://issues.apache.org/jira/browse/SPARK-26971
 Project: Spark
  Issue Type: Question
  Components: PySpark
Affects Versions: 1.6.0
Reporter: Babu


 

I am trying to read a cedilla delimited HDFS Text file. I am getting the below 
error, did any one face similar issue?

{{hadoop fs -cat test_file.dat }}

{{1ÇCelvelandÇOhio 2ÇDurhamÇNC 3ÇDallasÇTexas }}

{{>>> rdd = sc.textFile("test_file.dat") }}

{{>>> rdd.collect() [u'1\xc7Celveland\xc7Ohio', u'2\xc7Durham\xc7NC', 
u'3Dallas\xc7Texas'] }}

{{>>> rdd.map(lambda p: p.split("\xc7")).collect() UnicodeDecodeError: 'ascii' 
codec can't decode byte 0xc7 in position 0: ordinal not in range(128) }}

{{>>> 
sqlContext.read.format("text").option("delimiter","Ç").option("encoding","ISO-8859").load("/user/cloudera/test_file.dat").show()
 }}
|1ÇCelvelandÇOhio|

{{2ÇDurhamÇNC}}

{{ 3DallasÇTexas}}

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26945) Python streaming tests flaky while cleaning temp directories after StreamingQuery.stop

2019-02-22 Thread Alessandro Bellina (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775125#comment-16775125
 ] 

Alessandro Bellina commented on SPARK-26945:


[~hyukjin.kwon] thanks for taking a look. Seems like q.processAllAvailable is 
designed for this use case.

> Python streaming tests flaky while cleaning temp directories after 
> StreamingQuery.stop
> --
>
> Key: SPARK-26945
> URL: https://issues.apache.org/jira/browse/SPARK-26945
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Priority: Minor
>
> From the test code, it seems like the `shmutil.rmtree` function is trying to 
> delete a directory, but there's likely another thread adding entries to a 
> directory, so when it gets to `os.rmdir(path)` it blows up. I think the test 
> (and other streaming tests) should call `q.awaitTermination` after `q.stop`, 
> before going on. I'll file a separate jira.
> {noformat}
> ERROR: test_query_manager_await_termination 
> (pyspark.sql.tests.test_streaming.StreamingTests)
> --
> Traceback (most recent call last):
>  File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests/test_streaming.py",
>  line 259, in test_query_manager_await_termination
>  shutil.rmtree(tmpPath)
>  File "/home/anaconda/lib/python2.7/shutil.py", line 256, in rmtree
>  onerror(os.rmdir, path, sys.exc_info())
>  File "/home/anaconda/lib/python2.7/shutil.py", line 254, in rmtree
>  os.rmdir(path)
> OSError: [Errno 39] Directory not empty: 
> '/home/jenkins/workspace/SparkPullRequestBuilder/python/target/072153bd-f981-47be-bda2-e2b657a16f65/tmp4WGp7n'{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins

2019-02-22 Thread Alessandro Bellina (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775117#comment-16775117
 ] 

Alessandro Bellina edited comment on SPARK-26944 at 2/22/19 1:07 PM:
-

Hmm, I have a subsequent build from the same PR, and I don't see a link to the 
python tests either. Maybe I am looking in the wrong place? 

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102590/artifact/]


was (Author: abellina):
Hmm, I have a subsequent build from the same PR, and I don't see a link to the 
python tests either. Maybe I am looking in the wrong place?

 

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102590/artifact/

> Python unit-tests.log not available in artifacts for a build in Jenkins
> ---
>
> Key: SPARK-26944
> URL: https://issues.apache.org/jira/browse/SPARK-26944
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Priority: Minor
>
> I had a pr where the python unit tests failed.  The tests point at the 
> `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, 
> but I can't get to that from jenkins UI it seems (are all prs writing to the 
> same file?).
> {code:java}
> 
> Running PySpark tests
> 
> Running PySpark tests. Output is in 
> /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code}
> For reference, please see this build: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console
> This Jira is to make it available under the artifacts for each build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins

2019-02-22 Thread Alessandro Bellina (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775117#comment-16775117
 ] 

Alessandro Bellina commented on SPARK-26944:


Hmm, I have a subsequent build from the same PR, and I don't see a link to the 
python tests either. Maybe I am looking in the wrong place?

 

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102590/artifact/

> Python unit-tests.log not available in artifacts for a build in Jenkins
> ---
>
> Key: SPARK-26944
> URL: https://issues.apache.org/jira/browse/SPARK-26944
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Priority: Minor
>
> I had a pr where the python unit tests failed.  The tests point at the 
> `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, 
> but I can't get to that from jenkins UI it seems (are all prs writing to the 
> same file?).
> {code:java}
> 
> Running PySpark tests
> 
> Running PySpark tests. Output is in 
> /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code}
> For reference, please see this build: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console
> This Jira is to make it available under the artifacts for each build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer

2019-02-22 Thread Andrew Crosby (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Crosby updated SPARK-26970:
--
Description: 
The Interaction transformer 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala]
 is missing from the set of pyspark feature transformers 
[https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py]
 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}

  was:
The Interaction transformer 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
 is missing from the set of pyspark feature transformers 
[https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]
 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}


> Can't load PipelineModel that was created in Scala with Python due to missing 
> Interaction transformer
> -
>
> Key: SPARK-26970
> URL: https://issues.apache.org/jira/browse/SPARK-26970
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Andrew Crosby
>Priority: Major
>
> The Interaction transformer 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala]
>  is missing from the set of pyspark feature transformers 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py]
>  
> This means that it is impossible to create a model that includes an 
> Interaction transformer with pyspark. It also means that attempting to load a 
> PipelineModel created in Scala that includes an Interaction transformer with 
> pyspark fails with the following error:
> {code:java}
> AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer

2019-02-22 Thread Andrew Crosby (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Crosby updated SPARK-26970:
--
Description: 
The Interaction transformer 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
 is missing from the set of pyspark feature transformers 
[https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]
 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}

  was:
The Interaction transformer ( 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
 ) is missing from the set of pyspark feature transformers ( 
[https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]
 ).

 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}


> Can't load PipelineModel that was created in Scala with Python due to missing 
> Interaction transformer
> -
>
> Key: SPARK-26970
> URL: https://issues.apache.org/jira/browse/SPARK-26970
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Andrew Crosby
>Priority: Major
>
> The Interaction transformer 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
>  is missing from the set of pyspark feature transformers 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]
>  
> This means that it is impossible to create a model that includes an 
> Interaction transformer with pyspark. It also means that attempting to load a 
> PipelineModel created in Scala that includes an Interaction transformer with 
> pyspark fails with the following error:
> {code:java}
> AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer

2019-02-22 Thread Andrew Crosby (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Crosby updated SPARK-26970:
--
Description: 
The Interaction transformer ( 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
 ) is missing from the set of pyspark feature transformers ( 
[https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]
 ).

 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}

  was:
The Interaction transformer 
([https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
 is missing from the set of pyspark feature transformers 
([https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)].

 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}


> Can't load PipelineModel that was created in Scala with Python due to missing 
> Interaction transformer
> -
>
> Key: SPARK-26970
> URL: https://issues.apache.org/jira/browse/SPARK-26970
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Andrew Crosby
>Priority: Major
>
> The Interaction transformer ( 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
>  ) is missing from the set of pyspark feature transformers ( 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]
>  ).
>  
> This means that it is impossible to create a model that includes an 
> Interaction transformer with pyspark. It also means that attempting to load a 
> PipelineModel created in Scala that includes an Interaction transformer with 
> pyspark fails with the following error:
> {code:java}
> AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer

2019-02-22 Thread Andrew Crosby (JIRA)
Andrew Crosby created SPARK-26970:
-

 Summary: Can't load PipelineModel that was created in Scala with 
Python due to missing Interaction transformer
 Key: SPARK-26970
 URL: https://issues.apache.org/jira/browse/SPARK-26970
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 2.4.0
Reporter: Andrew Crosby


The Interaction transformer 
([https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
 is missing from the set of pyspark feature transformers 
([https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)].

 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26969) [Spark] Using ODBC not able to see the data in table when datatype is decimal

2019-02-22 Thread Sujith (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774994#comment-16774994
 ] 

Sujith commented on SPARK-26969:


i will further analyze the issue and raise a PR if required. thanks

> [Spark] Using ODBC not able to see the data in table when datatype is decimal
> -
>
> Key: SPARK-26969
> URL: https://issues.apache.org/jira/browse/SPARK-26969
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> # Using odbc rpm file install odbc 
>  # connect to odbc using isql -v spark2xsingle
>  # SQL> create table t1_t(id decimal(15,2));
>  # SQL> insert into t1_t values(15);
>  # 
> SQL> select * from t1_t;
> +-+
> | id |
> +-+
> +-+  Actual output is empty
> Note: When creating table of int data type select is giving result as below
> SQL> create table test_t1(id int);
> SQL> insert into test_t1 values(10);
> SQL> select * from test_t1;
> ++
> | id |
> ++
> | 10 |
> ++
> Needs to handle for decimal case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24417) Build and Run Spark on JDK11

2019-02-22 Thread M. Le Bihan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774954#comment-16774954
 ] 

M. Le Bihan edited comment on SPARK-24417 at 2/22/19 9:58 AM:
--

It becomes really troublesome to see Java 12 coming in few weeks while _Spark_ 
that is somewhat an impressive development in term of technology is hold on a 
JVM of year 2014. I have three questions, please :

1) What version of Spark will become compatible with Java 11 ? 2.4.1, 2.4.2 or 
3.0.0 ?

2) If Java 11 compatibility is postponed to Spark 3.0.0, when Spark 3.0.0 is 
planned to be released ?

3) Will Spark become fully compatible with standard, classical, normal Java 
then, or will it keep some kind of system programming that might keep him in 
jeopardy ? In one word : will he suffer the same troubles when attempting to 
run with Java 12, 13, 14 ?

 

Since the coming of Java 9, now Java 11, and at the door of Java 12, 18 months 
have passed. Can we have a date where Java 11 (and Java 12) compatibility will 
be available please ?

 


was (Author: mlebihan):
It becomes really troublesome to see Java 12 coming in few weeks while _Spark_ 
that is somewhat an impressive development in term of technology is hold on a 
JVM of year 2014. I have three questions, please :

1) What version of Spark will become compatible with Java 11 ? 2.4.1, 2.4.2 or 
3.0.0 ?

2) If Java 11 compatibility is postponed to Spark 3.0.0, when Spark 3.0.0 is 
planned to be released ?

3) Will Spark become fully compatible with standard, classical, normal Java 
then, or will it keep some kind of system programming that might keep him in 
jeopardy ? In one word : will he suffer the same troubles when attempting to 
run with Java 12, 13, 14 ?

 

Since the coming of Java 9, now Java 11, and at the door of Java 12, 18 months 
have passed. Can we have a date for Java 11 (and Java 12) compatibility will be 
available please ?

 

> Build and Run Spark on JDK11
> 
>
> Key: SPARK-24417
> URL: https://issues.apache.org/jira/browse/SPARK-24417
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> This is an umbrella JIRA for Apache Spark to support JDK11
> As JDK8 is reaching EOL, and JDK9 and 10 are already end of life, per 
> community discussion, we will skip JDK9 and 10 to support JDK 11 directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple ti

2019-02-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774957#comment-16774957
 ] 

Apache Spark commented on SPARK-25250:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/23871

> Race condition with tasks running when new attempt for same stage is created 
> leads to other task in the next attempt running on the same partition id 
> retry multiple times
> --
>
> Key: SPARK-25250
> URL: https://issues.apache.org/jira/browse/SPARK-25250
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.1
>Reporter: Parth Gandhi
>Priority: Major
>
> We recently had a scenario where a race condition occurred when a task from 
> previous stage attempt just finished before new attempt for the same stage 
> was created due to fetch failure, so the new task created in the second 
> attempt on the same partition id was retrying multiple times due to 
> TaskCommitDenied Exception without realizing that the task in earlier attempt 
> was already successful.  
> For example, consider a task with partition id 9000 and index 9000 running in 
> stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. 
> Just within this timespan, the above task completes successfully, thus, 
> marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has 
> not yet been created, the taskset info for that stage is not available to the 
> TaskScheduler so, naturally, the partition id 9000 has not been marked 
> completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same 
> partition id 9000. This task fails due to CommitDeniedException and since, it 
> does not see the corresponding partition id as been marked successful, it 
> keeps retrying multiple times until the job finally succeeds. It doesn't 
> cause any job failures because the DAG scheduler is tracking the partitions 
> separate from the task set managers.
>  
> Steps to Reproduce:
>  # Run any large job involving shuffle operation.
>  # When the ShuffleMap stage finishes and the ResultStage begins running, 
> cause this stage to throw a fetch failure exception(Try deleting certain 
> shuffle files on any host).
>  # Observe the task attempt numbers for the next stage attempt. Please note 
> that this issue is an intermittent one, so it might not happen all the time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24417) Build and Run Spark on JDK11

2019-02-22 Thread M. Le Bihan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774954#comment-16774954
 ] 

M. Le Bihan commented on SPARK-24417:
-

It becomes really troublesome to see Java 12 coming in few weeks while _Spark_ 
that is somewhat an impressive development in term of technology is hold on a 
JVM of year 2014. I have three questions, please :

1) What version of Spark will become compatible with Java 11 ? 2.4.1, 2.4.2 or 
3.0.0 ?

2) If Java 11 compatibility is postponed to Spark 3.0.0, when Spark 3.0.0 is 
planned to be released ?

3) Will Spark become fully compatible with standard, classical, normal Java 
then, or will it keep some kind of system programming that might keep him in 
jeopardy ? In one word : will he suffer the same troubles when attempting to 
run with Java 12, 13, 14 ?

 

Since the coming of Java 9, now Java 11, and at the door of Java 12, 18 months 
have passed. Can we have a date for Java 11 (and Java 12) compatibility will be 
available please ?

 

> Build and Run Spark on JDK11
> 
>
> Key: SPARK-24417
> URL: https://issues.apache.org/jira/browse/SPARK-24417
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> This is an umbrella JIRA for Apache Spark to support JDK11
> As JDK8 is reaching EOL, and JDK9 and 10 are already end of life, per 
> community discussion, we will skip JDK9 and 10 to support JDK 11 directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26969) [Spark] Using ODBC not able to see the data in table when datatype is decimal

2019-02-22 Thread ABHISHEK KUMAR GUPTA (JIRA)
ABHISHEK KUMAR GUPTA created SPARK-26969:


 Summary: [Spark] Using ODBC not able to see the data in table when 
datatype is decimal
 Key: SPARK-26969
 URL: https://issues.apache.org/jira/browse/SPARK-26969
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 2.4.0
Reporter: ABHISHEK KUMAR GUPTA


# Using odbc rpm file install odbc 
 # connect to odbc using isql -v spark2xsingle
 # SQL> create table t1_t(id decimal(15,2));
 # SQL> insert into t1_t values(15);
 # 
SQL> select * from t1_t;
+-+
| id |
+-+
+-+  Actual output is empty

Note: When creating table of int data type select is giving result as below
SQL> create table test_t1(id int);
SQL> insert into test_t1 values(10);
SQL> select * from test_t1;
++
| id |
++
| 10 |
++

Needs to handle for decimal case.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26968) option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation

2019-02-22 Thread M. Le Bihan (JIRA)
M. Le Bihan created SPARK-26968:
---

 Summary: option("quoteMode", "NON_NUMERIC") have no effect on a 
CSV generation
 Key: SPARK-26968
 URL: https://issues.apache.org/jira/browse/SPARK-26968
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: M. Le Bihan


I have a CSV to write that has that schema :
{code:java}
StructType s = schema.add("codeCommuneCR", StringType, false);
s = s.add("nomCommuneCR", StringType, false);
s = s.add("populationCR", IntegerType, false);
s = s.add("resultatComptable", IntegerType, false);{code}
If I don't provide an option "_quoteMode_" or even if I set it to 
{{NON_NUMERIC}}, this way :
{code:java}
ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
.option("quoteMode", "NON_NUMERIC").option("quote", "\"") 
.csv("./target/out_200071470.csv");{code}
the CSV written by {{Spark}} is this one :
{code:java}
codeCommuneCR,nomCommuneCR,populationCR,resultatComptable
03142,LENAX,267,43{code}
If I set an option "_quoteAll_" instead, like that :
{code:java}
ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
.option("quoteAll", true).option("quote", "\"") 
.csv("./target/out_200071470.csv");{code}
it generates :
{code:java}
"codeCommuneCR","nomCommuneCR","populationCR","resultatComptable" 
"03142","LENAX","267","43"{code}
It seems that the {{.option("quoteMode", "NON_NUMERIC")}} is broken. It should 
generate:

 
{code:java}
"codeCommuneCR","nomCommuneCR","populationCR","resultatComptable"
"03142","LENAX",267,43
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26945) Python streaming tests flaky while cleaning temp directories after StreamingQuery.stop

2019-02-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26945:


Assignee: Apache Spark

> Python streaming tests flaky while cleaning temp directories after 
> StreamingQuery.stop
> --
>
> Key: SPARK-26945
> URL: https://issues.apache.org/jira/browse/SPARK-26945
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Assignee: Apache Spark
>Priority: Minor
>
> From the test code, it seems like the `shmutil.rmtree` function is trying to 
> delete a directory, but there's likely another thread adding entries to a 
> directory, so when it gets to `os.rmdir(path)` it blows up. I think the test 
> (and other streaming tests) should call `q.awaitTermination` after `q.stop`, 
> before going on. I'll file a separate jira.
> {noformat}
> ERROR: test_query_manager_await_termination 
> (pyspark.sql.tests.test_streaming.StreamingTests)
> --
> Traceback (most recent call last):
>  File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests/test_streaming.py",
>  line 259, in test_query_manager_await_termination
>  shutil.rmtree(tmpPath)
>  File "/home/anaconda/lib/python2.7/shutil.py", line 256, in rmtree
>  onerror(os.rmdir, path, sys.exc_info())
>  File "/home/anaconda/lib/python2.7/shutil.py", line 254, in rmtree
>  os.rmdir(path)
> OSError: [Errno 39] Directory not empty: 
> '/home/jenkins/workspace/SparkPullRequestBuilder/python/target/072153bd-f981-47be-bda2-e2b657a16f65/tmp4WGp7n'{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26945) Python streaming tests flaky while cleaning temp directories after StreamingQuery.stop

2019-02-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26945:


Assignee: (was: Apache Spark)

> Python streaming tests flaky while cleaning temp directories after 
> StreamingQuery.stop
> --
>
> Key: SPARK-26945
> URL: https://issues.apache.org/jira/browse/SPARK-26945
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Priority: Minor
>
> From the test code, it seems like the `shmutil.rmtree` function is trying to 
> delete a directory, but there's likely another thread adding entries to a 
> directory, so when it gets to `os.rmdir(path)` it blows up. I think the test 
> (and other streaming tests) should call `q.awaitTermination` after `q.stop`, 
> before going on. I'll file a separate jira.
> {noformat}
> ERROR: test_query_manager_await_termination 
> (pyspark.sql.tests.test_streaming.StreamingTests)
> --
> Traceback (most recent call last):
>  File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests/test_streaming.py",
>  line 259, in test_query_manager_await_termination
>  shutil.rmtree(tmpPath)
>  File "/home/anaconda/lib/python2.7/shutil.py", line 256, in rmtree
>  onerror(os.rmdir, path, sys.exc_info())
>  File "/home/anaconda/lib/python2.7/shutil.py", line 254, in rmtree
>  os.rmdir(path)
> OSError: [Errno 39] Directory not empty: 
> '/home/jenkins/workspace/SparkPullRequestBuilder/python/target/072153bd-f981-47be-bda2-e2b657a16f65/tmp4WGp7n'{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26967) Put MetricsSystem instance names together for clearer management

2019-02-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26967:


Assignee: (was: Apache Spark)

> Put MetricsSystem instance names together for clearer management
> 
>
> Key: SPARK-26967
> URL: https://issues.apache.org/jira/browse/SPARK-26967
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: SongYadong
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> MetricsSystem instance creations have a scattered distribution in the project 
> code. So do their names. It may cause some inconvenience for browsing and 
> management. 
> If we put them together, we can have a uniform location for adding or 
> removing them, and have a overall view of MetircsSystem instances in current 
> project.
> It's also helpful for maintaining user documents by avoiding missing 
> something.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26967) Put MetricsSystem instance names together for clearer management

2019-02-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26967:


Assignee: Apache Spark

> Put MetricsSystem instance names together for clearer management
> 
>
> Key: SPARK-26967
> URL: https://issues.apache.org/jira/browse/SPARK-26967
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: SongYadong
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> MetricsSystem instance creations have a scattered distribution in the project 
> code. So do their names. It may cause some inconvenience for browsing and 
> management. 
> If we put them together, we can have a uniform location for adding or 
> removing them, and have a overall view of MetircsSystem instances in current 
> project.
> It's also helpful for maintaining user documents by avoiding missing 
> something.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26967) Put MetricsSystem instance names together for clearer management

2019-02-22 Thread SongYadong (JIRA)
SongYadong created SPARK-26967:
--

 Summary: Put MetricsSystem instance names together for clearer 
management
 Key: SPARK-26967
 URL: https://issues.apache.org/jira/browse/SPARK-26967
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: SongYadong


MetricsSystem instance creations have a scattered distribution in the project 
code. So do their names. It may cause some inconvenience for browsing and 
management. 
If we put them together, we can have a uniform location for adding or removing 
them, and have a overall view of MetircsSystem instances in current project.
It's also helpful for maintaining user documents by avoiding missing something.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins

2019-02-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774892#comment-16774892
 ] 

Hyukjin Kwon commented on SPARK-26944:
--

Actually it's usually able to see it. It's usually included in the artifact of 
the built image IIRC.

> Python unit-tests.log not available in artifacts for a build in Jenkins
> ---
>
> Key: SPARK-26944
> URL: https://issues.apache.org/jira/browse/SPARK-26944
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Priority: Minor
>
> I had a pr where the python unit tests failed.  The tests point at the 
> `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, 
> but I can't get to that from jenkins UI it seems (are all prs writing to the 
> same file?).
> {code:java}
> 
> Running PySpark tests
> 
> Running PySpark tests. Output is in 
> /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code}
> For reference, please see this build: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console
> This Jira is to make it available under the artifacts for each build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org