[jira] [Commented] (SPARK-26945) Python streaming tests flaky while cleaning temp directories after StreamingQuery.stop
[ https://issues.apache.org/jira/browse/SPARK-26945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775815#comment-16775815 ] Hyukjin Kwon commented on SPARK-26945: -- Thanks for reporting this, [~abellina] > Python streaming tests flaky while cleaning temp directories after > StreamingQuery.stop > -- > > Key: SPARK-26945 > URL: https://issues.apache.org/jira/browse/SPARK-26945 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Alessandro Bellina >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > From the test code, it seems like the `shmutil.rmtree` function is trying to > delete a directory, but there's likely another thread adding entries to a > directory, so when it gets to `os.rmdir(path)` it blows up. I think the test > (and other streaming tests) should call `q.awaitTermination` after `q.stop`, > before going on. I'll file a separate jira. > {noformat} > ERROR: test_query_manager_await_termination > (pyspark.sql.tests.test_streaming.StreamingTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests/test_streaming.py", > line 259, in test_query_manager_await_termination > shutil.rmtree(tmpPath) > File "/home/anaconda/lib/python2.7/shutil.py", line 256, in rmtree > onerror(os.rmdir, path, sys.exc_info()) > File "/home/anaconda/lib/python2.7/shutil.py", line 254, in rmtree > os.rmdir(path) > OSError: [Errno 39] Directory not empty: > '/home/jenkins/workspace/SparkPullRequestBuilder/python/target/072153bd-f981-47be-bda2-e2b657a16f65/tmp4WGp7n'{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26945) Python streaming tests flaky while cleaning temp directories after StreamingQuery.stop
[ https://issues.apache.org/jira/browse/SPARK-26945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26945. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23870 [https://github.com/apache/spark/pull/23870] > Python streaming tests flaky while cleaning temp directories after > StreamingQuery.stop > -- > > Key: SPARK-26945 > URL: https://issues.apache.org/jira/browse/SPARK-26945 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Alessandro Bellina >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > From the test code, it seems like the `shmutil.rmtree` function is trying to > delete a directory, but there's likely another thread adding entries to a > directory, so when it gets to `os.rmdir(path)` it blows up. I think the test > (and other streaming tests) should call `q.awaitTermination` after `q.stop`, > before going on. I'll file a separate jira. > {noformat} > ERROR: test_query_manager_await_termination > (pyspark.sql.tests.test_streaming.StreamingTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests/test_streaming.py", > line 259, in test_query_manager_await_termination > shutil.rmtree(tmpPath) > File "/home/anaconda/lib/python2.7/shutil.py", line 256, in rmtree > onerror(os.rmdir, path, sys.exc_info()) > File "/home/anaconda/lib/python2.7/shutil.py", line 254, in rmtree > os.rmdir(path) > OSError: [Errno 39] Directory not empty: > '/home/jenkins/workspace/SparkPullRequestBuilder/python/target/072153bd-f981-47be-bda2-e2b657a16f65/tmp4WGp7n'{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26945) Python streaming tests flaky while cleaning temp directories after StreamingQuery.stop
[ https://issues.apache.org/jira/browse/SPARK-26945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-26945: Assignee: Hyukjin Kwon > Python streaming tests flaky while cleaning temp directories after > StreamingQuery.stop > -- > > Key: SPARK-26945 > URL: https://issues.apache.org/jira/browse/SPARK-26945 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Alessandro Bellina >Assignee: Hyukjin Kwon >Priority: Minor > > From the test code, it seems like the `shmutil.rmtree` function is trying to > delete a directory, but there's likely another thread adding entries to a > directory, so when it gets to `os.rmdir(path)` it blows up. I think the test > (and other streaming tests) should call `q.awaitTermination` after `q.stop`, > before going on. I'll file a separate jira. > {noformat} > ERROR: test_query_manager_await_termination > (pyspark.sql.tests.test_streaming.StreamingTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests/test_streaming.py", > line 259, in test_query_manager_await_termination > shutil.rmtree(tmpPath) > File "/home/anaconda/lib/python2.7/shutil.py", line 256, in rmtree > onerror(os.rmdir, path, sys.exc_info()) > File "/home/anaconda/lib/python2.7/shutil.py", line 254, in rmtree > os.rmdir(path) > OSError: [Errno 39] Directory not empty: > '/home/jenkins/workspace/SparkPullRequestBuilder/python/target/072153bd-f981-47be-bda2-e2b657a16f65/tmp4WGp7n'{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16183) Large Spark SQL commands cause StackOverflowError in parser when using sqlContext.sql
[ https://issues.apache.org/jira/browse/SPARK-16183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775788#comment-16775788 ] Matt Saunders commented on SPARK-16183: --- It appears that this problem is still occurring as of Feb 2019 and Spark 2.4.0. As a workaround, you can use Dataset.checkpoint (or .localCheckpoint) to truncate the logical plan of the Dataset between transformations and avoid the stack overflow error. > Large Spark SQL commands cause StackOverflowError in parser when using > sqlContext.sql > - > > Key: SPARK-16183 > URL: https://issues.apache.org/jira/browse/SPARK-16183 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.6.1, 2.0.0 > Environment: Running on AWS EMR >Reporter: Matthew Porter >Priority: Major > > Hi, > I have created a PySpark SQL-based tool which auto-generates a complex SQL > command to be run via sqlContext.sql(cmd) based on a large number of > parameters. As the number of input files to be filtered and joined in this > query grows, so does the length of the SQL query. The tool runs fine up until > about 200+ files are included in the join, at which point the SQL command > becomes very long (~100K characters). It is only on these longer queries that > Spark fails, throwing an exception due to what seems to be too much recursion > occurring within the SparkSQL parser: > {code} > Traceback (most recent call last): > ... > merged_df = sqlsc.sql(cmd) > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line > 580, in sql > File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", > line 813, in __call__ > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, > in deco > File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line > 308, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o173.sql. > : java.lang.StackOverflowError > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.sca
[jira] [Commented] (SPARK-26977) Warn against subclassing scala.App doesn't work
[ https://issues.apache.org/jira/browse/SPARK-26977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775783#comment-16775783 ] Manu Zhang commented on SPARK-26977: I'd love to > Warn against subclassing scala.App doesn't work > --- > > Key: SPARK-26977 > URL: https://issues.apache.org/jira/browse/SPARK-26977 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.0 >Reporter: Manu Zhang >Priority: Minor > > As per discussion in > [PR#3497|https://github.com/apache/spark/pull/3497#discussion_r258412735], > the warn against subclassing scala.App doesn't work. For example, > {code:scala} > object Test extends scala.App { >// spark code > } > {code} > Scala will compile {{object Test}} into two Java classes, {{Test}} passed in > by user and {{Test$}} subclassing {{scala.App}}. Currect code checks against > {{Test}} and thus there will be no warn when user's application subclassing > {{scala.App}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25574) Add an option `keepQuotes` for parsing csv file
[ https://issues.apache.org/jira/browse/SPARK-25574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian resolved SPARK-25574. - Resolution: Invalid > Add an option `keepQuotes` for parsing csv file > > > Key: SPARK-25574 > URL: https://issues.apache.org/jira/browse/SPARK-25574 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > In our project, when we read the CSV file, we hope to keep quotes. > For example: > We have such a record in the CSV file.: > *ab,cc,,"c,ddd"* > We hope it displays like this: > |_c0|_c1|_c2| _c3| > | ab|cc |null|*"c,ddd"*| > > Not like this: > |_c0|_c1|_c2| _c3| > | ab|cc |null|c,ddd| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26977) Warn against subclassing scala.App doesn't work
[ https://issues.apache.org/jira/browse/SPARK-26977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775776#comment-16775776 ] Sean Owen commented on SPARK-26977: --- Sure, would you like to make a pull request? > Warn against subclassing scala.App doesn't work > --- > > Key: SPARK-26977 > URL: https://issues.apache.org/jira/browse/SPARK-26977 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.0 >Reporter: Manu Zhang >Priority: Minor > > As per discussion in > [PR#3497|https://github.com/apache/spark/pull/3497#discussion_r258412735], > the warn against subclassing scala.App doesn't work. For example, > {code:scala} > object Test extends scala.App { >// spark code > } > {code} > Scala will compile {{object Test}} into two Java classes, {{Test}} passed in > by user and {{Test$}} subclassing {{scala.App}}. Currect code checks against > {{Test}} and thus there will be no warn when user's application subclassing > {{scala.App}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26809) insert overwrite directory + concat function => error
[ https://issues.apache.org/jira/browse/SPARK-26809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775772#comment-16775772 ] Alessandro Bellina edited comment on SPARK-26809 at 2/23/19 3:25 AM: - This does it. Didn't need the limit to reproduce: {noformat} insert overwrite directory '/tmp/SPARK-26809' select concat(col1, col2) from ((select "foo" as col1, "bar" as col2)); {noformat} This also triggers it: {noformat} insert overwrite directory '/tmp/SPARK-26809' select concat("foo", "bar") {noformat} {noformat} Caused by: org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145) at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) at org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:121) at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:104) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:109) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$14(FileFormatWriter.scala:177) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:426) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:429) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} was (Author: abellina): This does it. Didn't need the limit to reproduce: {noformat} insert overwrite directory '/tmp/SPARK-26809' select concat(col1, col2) {noformat} This also triggers it: {noformat} insert overwrite directory '/tmp/SPARK-26809' select concat(col1, col2) from ((select "foo" as col1, "bar" as col2)); {noformat} {noformat} Caused by: org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145) at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) at org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:121) at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:104) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:109) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$14(FileFormatWriter.scala:177) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:426) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:429) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} > insert overwrite directory + concat function => error > - > > Key: SPARK-26809 > URL: https://issues.apache.org/jira/browse/SPARK-26809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: ant_nebula >Priority: Critical > > insert
[jira] [Comment Edited] (SPARK-26809) insert overwrite directory + concat function => error
[ https://issues.apache.org/jira/browse/SPARK-26809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775772#comment-16775772 ] Alessandro Bellina edited comment on SPARK-26809 at 2/23/19 3:25 AM: - This does it. Didn't need the limit to reproduce: {noformat} insert overwrite directory '/tmp/SPARK-26809' select concat(col1, col2) {noformat} This also triggers it: {noformat} insert overwrite directory '/tmp/SPARK-26809' select concat(col1, col2) from ((select "foo" as col1, "bar" as col2)); {noformat} {noformat} Caused by: org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145) at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) at org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:121) at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:104) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:109) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$14(FileFormatWriter.scala:177) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:426) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:429) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} was (Author: abellina): This does it. Didn't need the limit to reproduce: {noformat} insert overwrite directory '/tmp/SPARK-26809' select concat(col1, col2) from ((select "foo" as col1, "bar" as col2)); {noformat} This also triggers it: {noformat} insert overwrite directory '/tmp/SPARK-26809' select concat(col1, col2) from ((select "foo" as col1, "bar" as col2)); {noformat} {noformat} Caused by: org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145) at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) at org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:121) at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:104) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:109) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$14(FileFormatWriter.scala:177) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:426) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:429) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} > insert overwrite directory + concat function => error > - > > Key: SPARK-26809 > URL: https://issues.apache.org/jira/browse/SPARK-26809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: ant_nebula
[jira] [Comment Edited] (SPARK-26809) insert overwrite directory + concat function => error
[ https://issues.apache.org/jira/browse/SPARK-26809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775772#comment-16775772 ] Alessandro Bellina edited comment on SPARK-26809 at 2/23/19 3:25 AM: - This does it. Didn't need the limit to reproduce: {noformat} insert overwrite directory '/tmp/SPARK-26809' select concat(col1, col2) from ((select "foo" as col1, "bar" as col2)); {noformat} This also triggers it: {noformat} insert overwrite directory '/tmp/SPARK-26809' select concat(col1, col2) from ((select "foo" as col1, "bar" as col2)); {noformat} {noformat} Caused by: org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145) at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) at org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:121) at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:104) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:109) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$14(FileFormatWriter.scala:177) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:426) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:429) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} was (Author: abellina): This does it. Didn't need the limit to reproduce: {noformat} insert overwrite directory '/tmp/SPARK-26809' select concat(col1, col2) from ((select "foo" as col1, "bar" as col2)); {noformat} {noformat} Caused by: org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145) at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) at org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:121) at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:104) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:109) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$14(FileFormatWriter.scala:177) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:426) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:429) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} > insert overwrite directory + concat function => error > - > > Key: SPARK-26809 > URL: https://issues.apache.org/jira/browse/SPARK-26809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: ant_nebula >Priority: Critical > > insert overwrite directory '/tmp/xx' > select concat(col1, col2) > from tableXX > li
[jira] [Created] (SPARK-26977) Warn against subclassing scala.App doesn't work
Manu Zhang created SPARK-26977: -- Summary: Warn against subclassing scala.App doesn't work Key: SPARK-26977 URL: https://issues.apache.org/jira/browse/SPARK-26977 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.4.0 Reporter: Manu Zhang As per discussion in [PR#3497|https://github.com/apache/spark/pull/3497#discussion_r258412735], the warn against subclassing scala.App doesn't work. For example, {code:scala} object Test extends scala.App { // spark code } {code} Scala will compile {{object Test}} into two Java classes, {{Test}} passed in by user and {{Test$}} subclassing {{scala.App}}. Currect code checks against {{Test}} and thus there will be no warn when user's application subclassing {{scala.App}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26809) insert overwrite directory + concat function => error
[ https://issues.apache.org/jira/browse/SPARK-26809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775772#comment-16775772 ] Alessandro Bellina commented on SPARK-26809: This does it. Didn't need the limit to reproduce: {noformat} insert overwrite directory '/tmp/SPARK-26809' select concat(col1, col2) from ((select "foo" as col1, "bar" as col2)); {noformat} {noformat} Caused by: org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements! at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145) at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) at org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:121) at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:104) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:109) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$14(FileFormatWriter.scala:177) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:426) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:429) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} > insert overwrite directory + concat function => error > - > > Key: SPARK-26809 > URL: https://issues.apache.org/jira/browse/SPARK-26809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: ant_nebula >Priority: Critical > > insert overwrite directory '/tmp/xx' > select concat(col1, col2) > from tableXX > limit 3 > > Caused by: org.apache.hadoop.hive.serde2.SerDeException: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 3 elements > while columns.types has 2 elements! > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145) > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) > at > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) > at > org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:119) > at > org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103) > at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120) > at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:108) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:233) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-24615: - Assignee: Xingbo Jiang > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Xingbo Jiang >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26939) Fix some outdated comments about task schedulers
[ https://issues.apache.org/jira/browse/SPARK-26939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenxiao Mao updated SPARK-26939: - Description: Some comments about task schedulers are outdated. They should be fixed. * YarnClusterScheduler comments: reference to ClusterScheduler which is not used anymore. * TaskSetManager comments: method statusUpdate does not exist as of now. was: Some comments about task schedulers are outdated. They should be fixed. * TaskScheduler comments: currently implemented exclusively by org.apache.spark.scheduler.TaskSchedulerImpl. This is not true as of now. * YarnClusterScheduler comments: reference to ClusterScheduler which is not used anymore. * TaskSetManager comments: method statusUpdate does not exist as of now. > Fix some outdated comments about task schedulers > > > Key: SPARK-26939 > URL: https://issues.apache.org/jira/browse/SPARK-26939 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Priority: Minor > > Some comments about task schedulers are outdated. They should be fixed. > * YarnClusterScheduler comments: reference to ClusterScheduler which is not > used anymore. > * TaskSetManager comments: method statusUpdate does not exist as of now. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26975) Support nested-column pruning over limit/sample/repartition
[ https://issues.apache.org/jira/browse/SPARK-26975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-26975: - Assignee: Dongjoon Hyun > Support nested-column pruning over limit/sample/repartition > --- > > Key: SPARK-26975 > URL: https://issues.apache.org/jira/browse/SPARK-26975 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > As SPARK-26958 shows the benchmark, nested-column pruning has limitations. > This issue aims to remove the limitations on `limit/repartition/sample`. In > this issue, repartition means `Repartition`, not `RepartitionByExpression`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26215) define reserved keywords after SQL standard
[ https://issues.apache.org/jira/browse/SPARK-26215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-26215. -- Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 3.0.0 Resolved by [https://github.com/apache/spark/pull/23259] > define reserved keywords after SQL standard > --- > > Key: SPARK-26215 > URL: https://issues.apache.org/jira/browse/SPARK-26215 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.0 > > > There are 2 kinds of SQL keywords: reserved and non-reserved. Reserved > keywords can't be used as identifiers. > In Spark SQL, we are too tolerant about non-reserved keywors. A lot of > keywords are non-reserved and sometimes it cause ambiguity (IIRC we hit a > problem when improving the INTERVAL syntax). > I think it will be better to just follow other databases or SQL standard to > define reserved keywords, so that we don't need to think very hard about how > to avoid ambiguity. > For reference: https://www.postgresql.org/docs/8.1/sql-keywords-appendix.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26976) Forbid reserved keywords as identifiers when ANSI mode is on
Takeshi Yamamuro created SPARK-26976: Summary: Forbid reserved keywords as identifiers when ANSI mode is on Key: SPARK-26976 URL: https://issues.apache.org/jira/browse/SPARK-26976 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0 Reporter: Takeshi Yamamuro Assignee: Takeshi Yamamuro We need to throw an exception to forbid reserved keywords as identifiers when ANSI mode is on. This is a follow-up of SPARK-26215. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26918) All .md should have ASF license header
[ https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26918: Assignee: (was: Apache Spark) > All .md should have ASF license header > -- > > Key: SPARK-26918 > URL: https://issues.apache.org/jira/browse/SPARK-26918 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.4.0, 3.0.0 >Reporter: Felix Cheung >Priority: Major > > per policy, all md files should have the header, like eg. > [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md] > or > [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md] > > currently it does not > [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26918) All .md should have ASF license header
[ https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26918: Assignee: Apache Spark > All .md should have ASF license header > -- > > Key: SPARK-26918 > URL: https://issues.apache.org/jira/browse/SPARK-26918 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.4.0, 3.0.0 >Reporter: Felix Cheung >Assignee: Apache Spark >Priority: Major > > per policy, all md files should have the header, like eg. > [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md] > or > [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md] > > currently it does not > [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26918) All .md should have ASF license header
[ https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775673#comment-16775673 ] Mani M edited comment on SPARK-26918 at 2/22/19 11:02 PM: -- I can take it up this change was (Author: rmsm...@gmail.com): I can take it up this project > All .md should have ASF license header > -- > > Key: SPARK-26918 > URL: https://issues.apache.org/jira/browse/SPARK-26918 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.4.0, 3.0.0 >Reporter: Felix Cheung >Priority: Major > > per policy, all md files should have the header, like eg. > [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md] > or > [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md] > > currently it does not > [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26918) All .md should have ASF license header
[ https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775673#comment-16775673 ] Mani M commented on SPARK-26918: I can take it up this project > All .md should have ASF license header > -- > > Key: SPARK-26918 > URL: https://issues.apache.org/jira/browse/SPARK-26918 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.4.0, 3.0.0 >Reporter: Felix Cheung >Priority: Major > > per policy, all md files should have the header, like eg. > [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md] > or > [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md] > > currently it does not > [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26651) Use Proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk resolved SPARK-26651. Resolution: Done > Use Proleptic Gregorian calendar > > > Key: SPARK-26651 > URL: https://issues.apache.org/jira/browse/SPARK-26651 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: ReleaseNote > > Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in > date/timestamp parsing, functions and expressions. The ticket aims to switch > Spark on Proleptic Gregorian calendar, and use java.time classes introduced > in Java 8 for timestamp/date manipulations. One of the purpose of switching > on Proleptic Gregorian calendar is to conform to SQL standard which supposes > such calendar. > *Release note:* > Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, > formatting, and converting dates and timestamps as well as in extracting > sub-components like years, days and etc. It uses Java 8 API classes from the > java.time packages that based on [ISO chronology > |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]. > Previous versions of Spark performed those operations by using [the hybrid > calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html] > (Julian + Gregorian). The changes might impact on the results for dates and > timestamps before October 15, 1582 (Gregorian). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26774) Document threading concerns in TaskSchedulerImpl
[ https://issues.apache.org/jira/browse/SPARK-26774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26774: Assignee: Apache Spark > Document threading concerns in TaskSchedulerImpl > > > Key: SPARK-26774 > URL: https://issues.apache.org/jira/browse/SPARK-26774 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: Apache Spark >Priority: Major > > TaskSchedulerImpl has a couple of places threading concerns aren't clearly > documented, which could improved a little. There is also a race in > {{killTaskAttempt}} on {{taskIdToExecutorId}} (though I think nobody actually > uses this api). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26774) Document threading concerns in TaskSchedulerImpl
[ https://issues.apache.org/jira/browse/SPARK-26774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26774: Assignee: (was: Apache Spark) > Document threading concerns in TaskSchedulerImpl > > > Key: SPARK-26774 > URL: https://issues.apache.org/jira/browse/SPARK-26774 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > TaskSchedulerImpl has a couple of places threading concerns aren't clearly > documented, which could improved a little. There is also a race in > {{killTaskAttempt}} on {{taskIdToExecutorId}} (though I think nobody actually > uses this api). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26774) Document threading concerns in TaskSchedulerImpl
[ https://issues.apache.org/jira/browse/SPARK-26774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-26774: - Description: TaskSchedulerImpl has a couple of places threading concerns aren't clearly documented, which could improved a little. There is also a race in {{killTaskAttempt}} on {{taskIdToExecutorId}} (though I think nobody actually uses this api). (was: TaskSchedulerImpl has a bunch of threading concerns, which are not well documented -- in fact the docs it has are somewhat misleading. In particular, some of the methods should only be called within the DAGScheduler event loop. This suggests some potential refactoring to avoid so many mixed concerns inside TaskSchedulerImpl, but that's a lot harder to do safely, I just want to add some comments.) > Document threading concerns in TaskSchedulerImpl > > > Key: SPARK-26774 > URL: https://issues.apache.org/jira/browse/SPARK-26774 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > TaskSchedulerImpl has a couple of places threading concerns aren't clearly > documented, which could improved a little. There is also a race in > {{killTaskAttempt}} on {{taskIdToExecutorId}} (though I think nobody actually > uses this api). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26950) Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values
[ https://issues.apache.org/jira/browse/SPARK-26950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26950: -- Fix Version/s: 2.3.4 > Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values > --- > > Key: SPARK-26950 > URL: https://issues.apache.org/jira/browse/SPARK-26950 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.4, 2.4.2, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.4, 2.4.1, 3.0.0 > > > Apache Spark uses the predefined `Float.NaN` and `Double.NaN` for NaN values, > but there exists more NaN values with different binary presentations. > {code} > scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array > res1: Array[Byte] = Array(127, -64, 0, 0) > scala> val x = java.lang.Float.intBitsToFloat(-6966608) > x: Float = NaN > scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array > res2: Array[Byte] = Array(-1, -107, -78, -80) > {code} > `RandomDataGenerator` generates these NaN values. It's good, but it causes > `checkEvaluationWithUnsafeProjection` failures due to the difference between > `UnsafeRow` binary presentation. The following is the UT failure instance. > This issue aims to fix this flakiness. > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/ > {code} > Failed > org.apache.spark.sql.avro.AvroCatalystDataConversionSuite.flat schema > struct > with seed -81044812370056695 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26975) Support nested-column pruning over limit/sample/repartition
Dongjoon Hyun created SPARK-26975: - Summary: Support nested-column pruning over limit/sample/repartition Key: SPARK-26975 URL: https://issues.apache.org/jira/browse/SPARK-26975 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Dongjoon Hyun As SPARK-26958 shows the benchmark, nested-column pruning has limitations. This issue aims to remove the limitations on `limit/repartition/sample`. In this issue, repartition means `Repartition`, not `RepartitionByExpression`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22860) Spark workers log ssl passwords passed to the executors
[ https://issues.apache.org/jira/browse/SPARK-22860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775630#comment-16775630 ] tooptoop4 commented on SPARK-22860: --- [~kabhwan] spark.ssl.keyStorePassword and spark.ssl.keyPassword don't need to be passed to CoarseGrainedExecutorBackend. Only spark.ssl.trustStorePassword is used > Spark workers log ssl passwords passed to the executors > --- > > Key: SPARK-22860 > URL: https://issues.apache.org/jira/browse/SPARK-22860 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Felix K. >Priority: Major > > The workers log the spark.ssl.keyStorePassword and > spark.ssl.trustStorePassword passed by cli to the executor processes. The > ExecutorRunner should escape passwords to not appear in the worker's log > files in INFO level. In this example, you can see my 'SuperSecretPassword' in > a worker log: > {code} > 17/12/08 08:04:12 INFO ExecutorRunner: Launch command: > "/global/myapp/oem/jdk/bin/java" "-cp" > "/global/myapp/application/myapp_software/thing_loader_lib/core-repository-model-zzz-1.2.3-SNAPSHOT.jar > [...] > :/global/myapp/application/spark-2.1.1-bin-hadoop2.7/jars/*" "-Xmx16384M" > "-Dspark.authenticate.enableSaslEncryption=true" > "-Dspark.ssl.keyStorePassword=SuperSecretPassword" > "-Dspark.ssl.keyStore=/global/myapp/application/config/ssl/keystore.jks" > "-Dspark.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" > "-Dspark.ssl.enabled=true" "-Dspark.driver.port=39927" > "-Dspark.ssl.protocol=TLS" > "-Dspark.ssl.trustStorePassword=SuperSecretPassword" > "-Dspark.authenticate=true" "-Dmyapp_IMPORT_DATE=2017-10-30" > "-Dmyapp.config.directory=/global/myapp/application/config" > "-Dsolr.httpclient.builder.factory=com.company.myapp.loader.auth.LoaderConfigSparkSolrBasicAuthConfigurer" > > "-Djavax.net.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" > "-XX:+UseG1GC" "-XX:+UseStringDeduplication" > "-Dthings.loader.export.zzz_files=false" > "-Dlog4j.configuration=file:/global/myapp/application/config/spark-executor-log4j.properties" > "-XX:+HeapDumpOnOutOfMemoryError" "-XX:+UseStringDeduplication" > "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" > "spark://CoarseGrainedScheduler@192.168.0.1:39927" "--executor-id" "2" > "--hostname" "192.168.0.1" "--cores" "4" "--app-id" "app-20171208080412-" > "--worker-url" "spark://Worker@192.168.0.1:59530" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26975) Support nested-column pruning over limit/sample/repartition
[ https://issues.apache.org/jira/browse/SPARK-26975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26975: Assignee: Apache Spark > Support nested-column pruning over limit/sample/repartition > --- > > Key: SPARK-26975 > URL: https://issues.apache.org/jira/browse/SPARK-26975 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > As SPARK-26958 shows the benchmark, nested-column pruning has limitations. > This issue aims to remove the limitations on `limit/repartition/sample`. In > this issue, repartition means `Repartition`, not `RepartitionByExpression`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26975) Support nested-column pruning over limit/sample/repartition
[ https://issues.apache.org/jira/browse/SPARK-26975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26975: Assignee: (was: Apache Spark) > Support nested-column pruning over limit/sample/repartition > --- > > Key: SPARK-26975 > URL: https://issues.apache.org/jira/browse/SPARK-26975 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > As SPARK-26958 shows the benchmark, nested-column pruning has limitations. > This issue aims to remove the limitations on `limit/repartition/sample`. In > this issue, repartition means `Repartition`, not `RepartitionByExpression`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26895) When running spark 2.3 as a proxy user (--proxy-user), SparkSubmit fails to resolve globs owned by target user
[ https://issues.apache.org/jira/browse/SPARK-26895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-26895: -- Assignee: Alessandro Bellina > When running spark 2.3 as a proxy user (--proxy-user), SparkSubmit fails to > resolve globs owned by target user > -- > > Key: SPARK-26895 > URL: https://issues.apache.org/jira/browse/SPARK-26895 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.0 >Reporter: Alessandro Bellina >Assignee: Alessandro Bellina >Priority: Critical > > We are resolving globs in SparkSubmit here (by way of > prepareSubmitEnvironment) without first going into a doAs: > https://github.com/apache/spark/blob/6c18d8d8079ac4d2d6dc7539601ab83fc5b51760/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L143 > Without first entering a doAs, as done here: > [https://github.com/apache/spark/blob/6c18d8d8079ac4d2d6dc7539601ab83fc5b51760/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L151] > So when running spark-submit with --proxy-user, and for example --archives, > it will fail to launch unless the location of the archive is open to the user > that executed spark-submit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26895) When running spark 2.3 as a proxy user (--proxy-user), SparkSubmit fails to resolve globs owned by target user
[ https://issues.apache.org/jira/browse/SPARK-26895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-26895. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23806 [https://github.com/apache/spark/pull/23806] > When running spark 2.3 as a proxy user (--proxy-user), SparkSubmit fails to > resolve globs owned by target user > -- > > Key: SPARK-26895 > URL: https://issues.apache.org/jira/browse/SPARK-26895 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.0 >Reporter: Alessandro Bellina >Assignee: Alessandro Bellina >Priority: Critical > Fix For: 3.0.0 > > > We are resolving globs in SparkSubmit here (by way of > prepareSubmitEnvironment) without first going into a doAs: > https://github.com/apache/spark/blob/6c18d8d8079ac4d2d6dc7539601ab83fc5b51760/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L143 > Without first entering a doAs, as done here: > [https://github.com/apache/spark/blob/6c18d8d8079ac4d2d6dc7539601ab83fc5b51760/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L151] > So when running spark-submit with --proxy-user, and for example --archives, > it will fail to launch unless the location of the archive is open to the user > that executed spark-submit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true
[ https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean Georges Perrin updated SPARK-26972: Description: I found a few discrepencies while working with inferSchema set to true in CSV ingestion. Given the following CSV in the attached books.csv: {noformat} id;authorId;title;releaseDate;link 1;1;Fantastic Beasts and Where to Find Them: The Original Screenplay;11/18/16;http://amzn.to/2kup94P 2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP 3;1;*The Tales of Beedle the Bard, Standard Edition (Harry Potter)*;12/4/08;http://amzn.to/2kYezqr 4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n 5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT 6;2;*Development Tools in 2006: any Room for a 4GL-style Language? An independent study by Jean Georges Perrin, IIUG Board Member*;12/28/16;http://amzn.to/2vBxOe1 7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav 8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD 10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA 11;4;Diderot Encyclopedia: The Complete Illustrations 1762-1777;;http://amzn.to/2i2zo3I 12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ 13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW 14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk 15;7;Soft Skills: The software developer's life manual;12/29/14;http://amzn.to/2zNnSyn 16;8;Of Mice and Men;;http://amzn.to/2zJjXoc 17;9;*Java 8 in Action: Lambdas; Streams; and functional-style programming*;8/28/14;http://amzn.to/2isdqoL 18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY 19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG 20;14;*Fables choisies; mises en vers par M. de La Fontaine*;9/1/1999;http://amzn.to/2yRH10W 21;15;Discourse on Method and Meditations on First Philosophy;6/15/1999;http://amzn.to/2hwB8zc 22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo 23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat} And this Java code: {code:java} Dataset df = spark.read().format("csv") .option("header", "true") .option("multiline", true) .option("sep", ";") .option("quote", "*") .option("dateFormat", "M/d/y") .option("inferSchema", true) .load("data/books.csv"); df.show(7); df.printSchema(); {code} h1. In Spark v2.0.1 Output: {noformat} +---+++---++ | id|authorId| title|releaseDate|link| +---+++---++ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| | 2| 1|Harry Potter and ...|10/6/15|http://amzn.to/2l...| | 3| 1|The Tales of Beed...|12/4/08|http://amzn.to/2k...| | 4| 1|Harry Potter and ...|10/4/16|http://amzn.to/2k...| | 5| 2|Informix 12.10 on...|4/23/17|http://amzn.to/2i...| | 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...| | 7| 3|Adventures of Huc...|. 5/26/94|http://amzn.to/2w...| +---+++---++ only showing top 7 rows Dataframe's schema: root |-- id: integer (nullable = true) |-- authorId: integer (nullable = true) |-- title: string (nullable = true) |-- releaseDate: string (nullable = true) |-- link: string (nullable = true) {noformat} *This is fine and the expected output*. h1. Using Apache Spark v2.1.3 Excerpt of the dataframe content: {noformat} ++++---++ | id|authorId| title|releaseDate| link| ++++---++ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...| | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...| | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...| | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...| | 6| 2|Development Tools...| null| null| |An independent st...|12/28/16|http://amzn.to/2v...| null| null| ++++---++ only showing top 7 rows Dataframe's schema: root |-- id: string (nullable = true) |-- authorId: string (nullable = true) |-- title: string (nullable = true) |-- releaseDate: string (nullable = true) |-- link: string (nullable = true){noformat} The *multiline* option is *not recognized*. And, of course, the schema is wrong. h1. Using Apache Spark v2.2.3 Excerpt of the dataframe content: {noformat} +---+++---++ | id|authorId| title|releaseDate| link | +---+++---
[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true
[ https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean Georges Perrin updated SPARK-26972: Description: I found a few discrepencies while working with inferSchema set to true in CSV ingestion. Given the following CSV in the attached books.csv: {noformat} id;authorId;title;releaseDate;link 1;1;Fantastic Beasts and Where to Find Them: The Original Screenplay;11/18/16;http://amzn.to/2kup94P 2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP 3;1;*The Tales of Beedle the Bard, Standard Edition (Harry Potter)*;12/4/08;http://amzn.to/2kYezqr 4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n 5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT 6;2;*Development Tools in 2006: any Room for a 4GL-style Language? An independent study by Jean Georges Perrin, IIUG Board Member*;12/28/16;http://amzn.to/2vBxOe1 7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav 8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD 10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA 11;4;Diderot Encyclopedia: The Complete Illustrations 1762-1777;;http://amzn.to/2i2zo3I 12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ 13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW 14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk 15;7;Soft Skills: The software developer's life manual;12/29/14;http://amzn.to/2zNnSyn 16;8;Of Mice and Men;;http://amzn.to/2zJjXoc 17;9;*Java 8 in Action: Lambdas; Streams; and functional-style programming*;8/28/14;http://amzn.to/2isdqoL 18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY 19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG 20;14;*Fables choisies; mises en vers par M. de La Fontaine*;9/1/1999;http://amzn.to/2yRH10W 21;15;Discourse on Method and Meditations on First Philosophy;6/15/1999;http://amzn.to/2hwB8zc 22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo 23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat} And this Java code: {code:java} Dataset df = spark.read().format("csv") .option("header", "true") .option("multiline", true) .option("sep", ";") .option("quote", "*") .option("dateFormat", "M/d/y") .option("inferSchema", true) .load("data/books.csv"); df.show(7); df.printSchema(); {code} h1. In Spark v2.0.1 Output: {noformat} +---+++---++ | id|authorId| title|releaseDate| link| +---+++---++ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...| | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...| | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...| | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...| | 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...| | 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...| +---+++---++ only showing top 7 rows Dataframe's schema: root |-- id: integer (nullable = true) |-- authorId: integer (nullable = true) |-- title: string (nullable = true) |-- releaseDate: string (nullable = true) |-- link: string (nullable = true) {noformat} *This is fine and the expected output*. h1. Using Apache Spark v2.1.3 Excerpt of the dataframe content: {noformat} ++++---++ | id|authorId| title|releaseDate| link| ++++---++ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...| | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...| | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...| | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...| | 6| 2|Development Tools...| null| null| |An independent st...|12/28/16|http://amzn.to/2v...| null| null| ++++---++ only showing top 7 rows Dataframe's schema: root |-- id: string (nullable = true) |-- authorId: string (nullable = true) |-- title: string (nullable = true) |-- releaseDate: string (nullable = true) |-- link: string (nullable = true){noformat} The *multiline* option is *not recognized*. And, of course, the schema is wrong. h1. Using Apache Spark v2.2.3 Excerpt of the dataframe content: {noformat} +---+++---++ | id|authorId| title|releaseDate| link | +---+++---++ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| | 2| 1|Harry P
[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true
[ https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean Georges Perrin updated SPARK-26972: Description: Issue with CSV import and inferSchema set to true. I found a few discrepencies while working with inferSchema set to true in CSV ingestion. Given the following CSV in the attached books.csv: {noformat} id;authorId;title;releaseDate;link 1;1;Fantastic Beasts and Where to Find Them: The Original Screenplay;11/18/16;http://amzn.to/2kup94P 2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP 3;1;*The Tales of Beedle the Bard, Standard Edition (Harry Potter)*;12/4/08;http://amzn.to/2kYezqr 4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n 5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT 6;2;*Development Tools in 2006: any Room for a 4GL-style Language? An independent study by Jean Georges Perrin, IIUG Board Member*;12/28/16;http://amzn.to/2vBxOe1 7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav 8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD 10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA 11;4;Diderot Encyclopedia: The Complete Illustrations 1762-1777;;http://amzn.to/2i2zo3I 12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ 13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW 14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk 15;7;Soft Skills: The software developer's life manual;12/29/14;http://amzn.to/2zNnSyn 16;8;Of Mice and Men;;http://amzn.to/2zJjXoc 17;9;*Java 8 in Action: Lambdas; Streams; and functional-style programming*;8/28/14;http://amzn.to/2isdqoL 18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY 19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG 20;14;*Fables choisies; mises en vers par M. de La Fontaine*;9/1/1999;http://amzn.to/2yRH10W 21;15;Discourse on Method and Meditations on First Philosophy;6/15/1999;http://amzn.to/2hwB8zc 22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo 23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat} And this Java code: {code:java} Dataset df = spark.read().format("csv") .option("header", "true") .option("multiline", true) .option("sep", ";") .option("quote", "*") .option("dateFormat", "M/d/y") .option("inferSchema", true) .load("data/books.csv"); df.show(7); df.printSchema(); {code} h1. In Spark v2.0.1 Output: {noformat} +---+++---++ | id|authorId| title|releaseDate| link| +---+++---++ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...| | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...| | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...| | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...| | 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...| | 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...| +---+++---++ only showing top 7 rows Dataframe's schema: root |-- id: integer (nullable = true) |-- authorId: integer (nullable = true) |-- title: string (nullable = true) |-- releaseDate: string (nullable = true) |-- link: string (nullable = true) {noformat} *This is fine and the expected output*. h1. Using Apache Spark v2.1.3 Excerpt of the dataframe content: {noformat} ++++---++ | id|authorId| title|releaseDate| link| ++++---++ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...| | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...| | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...| | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...| | 6| 2|Development Tools...| null| null| |An independent st...|12/28/16|http://amzn.to/2v...| null| null| ++++---++ only showing top 7 rows Dataframe's schema: root |-- id: string (nullable = true) |-- authorId: string (nullable = true) |-- title: string (nullable = true) |-- releaseDate: string (nullable = true) |-- link: string (nullable = true){noformat} The *multiline* option is *not recognized*. And, of course, the schema is wrong. h1. Using Apache Spark v2.2.3 Excerpt of the dataframe content: {noformat} +---+++---++ | id|authorId| title|releaseDate| link | +---+++---++ | 1
[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true
[ https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean Georges Perrin updated SPARK-26972: Description: Issue with CSV import and inferSchema set to true. I found a few discrepencies while working with inferSchema set to true in CSV ingestion. Given the following CSV in the attached books.csv: {noformat} id;authorId;title;releaseDate;link 1;1;Fantastic Beasts and Where to Find Them: The Original Screenplay;11/18/16;http://amzn.to/2kup94P 2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP 3;1;*The Tales of Beedle the Bard, Standard Edition (Harry Potter)*;12/4/08;http://amzn.to/2kYezqr 4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n 5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT 6;2;*Development Tools in 2006: any Room for a 4GL-style Language? An independent study by Jean Georges Perrin, IIUG Board Member*;12/28/16;http://amzn.to/2vBxOe1 7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav 8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD 10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA 11;4;Diderot Encyclopedia: The Complete Illustrations 1762-1777;;http://amzn.to/2i2zo3I 12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ 13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW 14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk 15;7;Soft Skills: The software developer's life manual;12/29/14;http://amzn.to/2zNnSyn 16;8;Of Mice and Men;;http://amzn.to/2zJjXoc 17;9;*Java 8 in Action: Lambdas; Streams; and functional-style programming*;8/28/14;http://amzn.to/2isdqoL 18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY 19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG 20;14;*Fables choisies; mises en vers par M. de La Fontaine*;9/1/1999;http://amzn.to/2yRH10W 21;15;Discourse on Method and Meditations on First Philosophy;6/15/1999;http://amzn.to/2hwB8zc 22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo 23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat} And this Java code: {code:java} Dataset df = spark.read().format("csv") .option("header", "true") .option("multiline", true) .option("sep", ";") .option("quote", "*") .option("dateFormat", "M/d/y") .option("inferSchema", true) .load("data/books.csv"); df.show(7); df.printSchema(); {code} h1. In Spark v2.0.1 {code:java} +---+++---++ | id|authorId| title|releaseDate| link| +---+++---++ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...| | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...| | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...| | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...| | 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...| | 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...| +---+++---++ only showing top 7 rows Dataframe's schema: root |-- id: integer (nullable = true) |-- authorId: integer (nullable = true) |-- title: string (nullable = true) |-- releaseDate: string (nullable = true) |-- link: string (nullable = true) {code} *This is fine and the expected output*. h1. Using Apache Spark v2.1.3 Excerpt of the dataframe content: {{+---+----}} \{{ | id|authorId| title|releaseDate| link|}} {{ +---+----}} \{{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} \{{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} \{{ | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} \{{ | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} \{{ | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} \{{ | 6| 2|Development Tools...| null| null|}} \{{ |An independent st...|12/28/16|http://amzn.to/2v...| null| null|}} {{ +---+----}} \{{ only showing top 7 rows}}{{Dataframe's schema:}} \{{ root}} \{{ |-- id: string (nullable = true)}} \{{ |-- authorId: string (nullable = true)}} \{{ |-- title: string (nullable = true)}} \{{ |-- releaseDate: string (nullable = true)}} \{{ |-- link: string (nullable = true)}} The *multiline* option is *not recognized*. And, of course, the schema is wrong. h1. Using Apache Spark v2.2.3 Excerpt of the dataframe content: {{+--+----}} {{| id|authorId| title|releaseDate| link}
[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true
[ https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean Georges Perrin updated SPARK-26972: Description: Issue with CSV import and inferSchema set to true. I found a few discrepencies while working with inferSchema set to true in CSV ingestion. Given the following CSV in the attached books.csv: {noformat} id;authorId;title;releaseDate;link 1;1;Fantastic Beasts and Where to Find Them: The Original Screenplay;11/18/16;http://amzn.to/2kup94P 2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP 3;1;*The Tales of Beedle the Bard, Standard Edition (Harry Potter)*;12/4/08;http://amzn.to/2kYezqr 4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n 5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT 6;2;*Development Tools in 2006: any Room for a 4GL-style Language? An independent study by Jean Georges Perrin, IIUG Board Member*;12/28/16;http://amzn.to/2vBxOe1 7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav 8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD 10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA 11;4;Diderot Encyclopedia: The Complete Illustrations 1762-1777;;http://amzn.to/2i2zo3I 12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ 13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW 14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk 15;7;Soft Skills: The software developer's life manual;12/29/14;http://amzn.to/2zNnSyn 16;8;Of Mice and Men;;http://amzn.to/2zJjXoc 17;9;*Java 8 in Action: Lambdas; Streams; and functional-style programming*;8/28/14;http://amzn.to/2isdqoL 18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY 19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG 20;14;*Fables choisies; mises en vers par M. de La Fontaine*;9/1/1999;http://amzn.to/2yRH10W 21;15;Discourse on Method and Meditations on First Philosophy;6/15/1999;http://amzn.to/2hwB8zc 22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo 23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat} {{id;authorId;title;releaseDate;link}} {{1;1;Fantastic Beasts and Where to Find Them: The Original Scree}}{{nplay;11/18/16;[http://amzn.to/2kup94P]}} {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter; Book 1)*;10/6/15;[http://amzn.to/2l2lSwP]}} {{3;1;The Tales of Beedle the B}}{{ard, Standard Edition (Harry Potter);12/4/08;[http://amzn.to/2kYezqr]}} {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter; Book 2)*;10/4/16;[http://amzn.to/2kYhL5n]}} {{5;2;Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; the Coffee; and a Great Database;4/23/17;[http://amzn.to/2i3mthT]}} {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language?}} {{An independent study by Jean Georges Perrin, IIUG Board Member*;12/28/16;[http://amzn.to/2vBxOe1]}} {{7;3;Adventures of Huckleberry Finn;5/26/94;[http://amzn.to/2wOeOav]}} {{8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;[http://amzn.to/2x1NuoD]}} {{10;4;Jacques le Fataliste;3/1/00;[http://amzn.to/2uZj2KA]}} {{11;4;Diderot Encyclopedia: The Complete Illustrations 1762-1777;;[http://amzn.to/2i2zo3I]}} {{12;;A Woman in Berlin;7/11/06;[http://amzn.to/2i472WZ]}} {{13;6;Spring Boot in Action;1/3/16;[http://amzn.to/2hCPktW]}} {{14;6;Spring in Action: Covers Spring 4;11/28/14;[http://amzn.to/2yJLyCk]}} {{15;7;Soft Skills: The software developer's life manual;12/29/14;[http://amzn.to/2zNnSyn]}} {{16;8;Of Mice and Men;;[http://amzn.to/2zJjXoc]}} {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style programming*;8/28/14;[http://amzn.to/2isdqoL]}} {{18;12;Hamlet;6/8/12;[http://amzn.to/2yRbewY]}} {{19;13;Pensées;12/31/1670;[http://amzn.to/2jweHOG]}} {{20;14;*Fables choisies; mises en vers par M. de La Fontaine*;9/1/1999;[http://amzn.to/2yRH10W]}} {{21;15;Discourse on Method and Meditations on First Philosophy;6/15/1999;[http://amzn.to/2hwB8zc]}} {{22;12;Twelfth Night;7/1/4;[http://amzn.to/2zPYnwo]}} {{23;12;Macbeth;7/1/3;[http://amzn.to/2zPYnwo]}} And this code: {{Dataset df = spark.read().format("csv")}} {{ .option("header", "true")}} {{ .option("multiline", true)}} {{ .option("sep", ";")}} {{ .option("quote", "*")}} {{ .option("dateFormat", "M/d/y")}} {{ .option("inferSchema", true)}} {{ .load("data/books.csv");}} {{df.show(7);}} {{df.printSchema();}} h1. In Spark v2.0.1 {{Excerpt of the dataframe content:}} {{+-+--+++---+}} {{| id|authorId| title|releaseDate| link|}} {{+-+--+++---+}} {{| 1| 1|Fantastic Beasts ...| 11/18/16
[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true
[ https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean Georges Perrin updated SPARK-26972: Description: Issue with CSV import and inferSchema set to true. I found a few discrepencies while working with inferSchema set to true in CSV ingestion. Given the following CSV: {{id;authorId;title;releaseDate;link}} {{1;1;Fantastic Beasts and Where to Find Them: The Original Scree}}{{nplay;11/18/16;[http://amzn.to/2kup94P]}} {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter; Book 1)*;10/6/15;[http://amzn.to/2l2lSwP]}} {{3;1;The Tales of Beedle the B}}{{ard, Standard Edition (Harry Potter);12/4/08;[http://amzn.to/2kYezqr]}} {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter; Book 2)*;10/4/16;[http://amzn.to/2kYhL5n]}} {{5;2;Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; the Coffee; and a Great Database;4/23/17;[http://amzn.to/2i3mthT]}} {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language?}} {{An independent study by Jean Georges Perrin, IIUG Board Member*;12/28/16;[http://amzn.to/2vBxOe1]}} {{7;3;Adventures of Huckleberry Finn;5/26/94;[http://amzn.to/2wOeOav]}} {{8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;[http://amzn.to/2x1NuoD]}} {{10;4;Jacques le Fataliste;3/1/00;[http://amzn.to/2uZj2KA]}} {{11;4;Diderot Encyclopedia: The Complete Illustrations 1762-1777;;[http://amzn.to/2i2zo3I]}} {{12;;A Woman in Berlin;7/11/06;[http://amzn.to/2i472WZ]}} {{13;6;Spring Boot in Action;1/3/16;[http://amzn.to/2hCPktW]}} {{14;6;Spring in Action: Covers Spring 4;11/28/14;[http://amzn.to/2yJLyCk]}} {{15;7;Soft Skills: The software developer's life manual;12/29/14;[http://amzn.to/2zNnSyn]}} {{16;8;Of Mice and Men;;[http://amzn.to/2zJjXoc]}} {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style programming*;8/28/14;[http://amzn.to/2isdqoL]}} {{18;12;Hamlet;6/8/12;[http://amzn.to/2yRbewY]}} {{19;13;Pensées;12/31/1670;[http://amzn.to/2jweHOG]}} {{20;14;*Fables choisies; mises en vers par M. de La Fontaine*;9/1/1999;[http://amzn.to/2yRH10W]}} {{21;15;Discourse on Method and Meditations on First Philosophy;6/15/1999;[http://amzn.to/2hwB8zc]}} {{22;12;Twelfth Night;7/1/4;[http://amzn.to/2zPYnwo]}} {{23;12;Macbeth;7/1/3;[http://amzn.to/2zPYnwo]}} And this code: {{Dataset df = spark.read().format("csv")}} {{ .option("header", "true")}} {{ .option("multiline", true)}} {{ .option("sep", ";")}} {{ .option("quote", "*")}} {{ .option("dateFormat", "M/d/y")}} {{ .option("inferSchema", true)}} {{ .load("data/books.csv");}} {{df.show(7);}} {{df.printSchema();}} h1. In Spark v2.0.1 {{Excerpt of the dataframe content:}} {{++---++---++}} {{| id|authorId| title|releaseDate| link|}} {{++---++---++}} {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}} {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}} {{++---++---++}} {{only showing top 7 rows}}{{Dataframe's schema:}} {{root}} \{{ |-- id: integer (nullable = true)}} \{{ |-- authorId: integer (nullable = true)}} \{{ |-- title: string (nullable = true)}} \{{ |-- releaseDate: string (nullable = true)}} \{{ |-- link: string (nullable = true)}} *This is fine and the expected output*. h1. Using Apache Spark v2.1.3 Excerpt of the dataframe content: {{+-+---++---++}} \{{ | id|authorId| title|releaseDate| link|}} {{ +-+---++---++}} \{{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} \{{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} \{{ | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} \{{ | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} \{{ | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} \{{ | 6| 2|Development Tools...| null| null|}} \{{ |An independent st...|12/28/16|http://amzn.to/2v...| null| null|}} {{ +-+---++---++}} \{{ only showing top 7 rows}}{{Dataframe's schema:}} \{{ root}} \{{ |-- id: string (nullable = true)}} \{{ |-- authorId: string (nullable = true)}} \{{ |-- title: string (nullable = true)}} \{{ |-- releaseDate: string (nullable = true)}} \{{ |-- link: string (nullable = true)}} The *multiline*
[jira] [Commented] (SPARK-26973) Kubernetes version support strategy on test nodes / backend
[ https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775435#comment-16775435 ] Erik Erlandson commented on SPARK-26973: A couple other points: * Currently, k8s is evolving in a manner where breakage of existing functionality is low probability, and so testing against the earliest version we wish to support is probably optimal in a scenario where we are choosing one version to test against. (This heuristic might change in the future, for example if k8s goes to a 2.x series where backward compatibility may be broken) * The integration testing was designed to support running against external clusters (GCP, etc) - this might provide an approach to supporting testing against multiple k8s versions. However, it would come with additional op-ex costs and decreased control over the environment. I mention it mostly because it's a plausible path to outsourcing some of the combinatorics that [~shaneknapp] discussed above > Kubernetes version support strategy on test nodes / backend > --- > > Key: SPARK-26973 > URL: https://issues.apache.org/jira/browse/SPARK-26973 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes has a policy for supporting three minor releases and the current > ones are defined here: > [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] > Moving from release 1.x to 1.(x+1) happens roughly every 100 > days:[https://gravitational.com/blog/kubernetes-release-cycle.] > This has an effect on dependencies upgrade at the Spark on K8s backend and > the version of Minikube required to be supported for testing. One other issue > is what the users actually want at the given time of a release. Some popular > vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap > for releases and may not catch up fast (what is our view on this). > Follow the comments for a recent discussion on the topic: > [https://github.com/apache/spark/pull/23814.] > Clearly we need a strategy for this. > A couple of options for the current state of things: > a) Support only the last two versions, but that leaves out a version that > still receives patches. > b) Support only the latest, which makes testing easier, but leaves out other > currently maintained version. > A good strategy will optimize at least the following: > 1) percentage of users satisfied at release time. > 2) how long it takes to support the latest K8s version > 3) testing requirements eg. minikube versions used > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26974) Invalid data in grouped cached dataset, formed by joining a large cached dataset with a small dataset
[ https://issues.apache.org/jira/browse/SPARK-26974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Utkarsh Sharma updated SPARK-26974: --- Description: The initial datasets are derived from hive tables using the spark.table() functions. Dataset descriptions: *+Sales+* dataset (close to 10 billion rows) with the following columns (and sample rows) : ||ItemId (bigint)||CustomerId (bigint)||qty_sold (bigint)|| |1|1|20| |1|2|30| |2|1|40| +*Customer*+ Dataset (close to 5 rows) with the following columns (and sample rows): ||CustomerId (bigint)||CustomerGrpNbr (smallint)|| |1|1| |2|2| |3|1| I am doing the following steps: # Caching sales dataset with close to 10 billion rows. # Doing an inner join of 'sales' with 'customer' dataset # Doing group by on the resultant dataset, based on CustomerGrpNbr column to get sum(qty_sold) and stddev(qty_sold) vales in the customer groups. # Caching the resultant grouped dataset. # Doing a .count() on the grouped dataset. The step 5 count is supposed to return only 20, because when you do a customer.select("CustomerGroupNbr").distinct().count you get 20 values. However, you get a value of around 65,000 in step 5. Following are the commands I am running in spark-shell: {code:java} var sales = spark.table("sales_table") var customer = spark.table("customer_table") var finalDf = sales.join(customer, "CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), stddev("qty_sold")) sales.cache() finalDf.cache() finalDf.count() // returns around 65k rows and the count keeps on varying each run customer.select("CustomerGrpNbr").distinct().count() //returns 20{code} I have been able to replicate the same behavior using the java api as well. This anamolous behavior disappears however, when I remove the caching statements. I.e. if i run the following in spark-shell, it works as expected: {code:java} var sales = spark.table("sales_table") var customer = spark.table("customer_table") var finalDf = sales.join(customer, "CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), stddev("qty_sold")) finalDf.count() // returns 20 customer.select("CustomerGrpNbr").distinct().count() //returns 20 {code} The tables in hive from which the datasets are built do not change during this entire process. So why does the caching cause this problem? was: The initial datasets are derived from hive tables using the spark.table() functions. Dataset descriptions: *+Sales+* dataset (close to 10 billion rows) with the following columns (and sample rows) : ||ItemId (bigint)||CustomerId (bigint)||qty_sold (bigint)|| |1|1|20| |1|2|30| |2|1|40| +*Customer*+ Dataset (close to 5 rows) with the following columns (and sample rows): ||CustomerId (bigint)||CustomerGrpNbr (smallint)|| |1|1| |2|2| |3|1| I am doing the following steps: # Caching sales dataset with close to 10 billion rows. # Doing an inner join of 'sales' with 'customer' dataset # Doing group by on the resultant dataset, based on CustomerGrpNbr column to get sum(qty_sold) and stddev(qty_sold) vales in the customer groups. # Caching the resultant grouped dataset. # Doing a .count() on the grouped dataset. The step 5 count is supposed to return only 20, because when you do a customer.select("CustomerGroupNbr").distinct().count you get 20 values. However, you get a value of around 65,000 in step 5. Following are the commands I am running in spark-shell: {code:java} var sales = spark.table("sales_table") var customer = spark.table("customer_table") var finalDf = sales.join(customer, "CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), stddev("qty_sold")) sales.cache() finalDf.cache() finalDf.count() // returns around 65k rows and the count keeps on varying each // run customer.select("CustomerGrpNbr").distinct().count() //returns 20{code} I have been able to replicate the same behavior using the java api as well. This anamolous behavior disappears however, when I remove the caching statements. I.e. if i run the following in spark-shell, it works as expected: {code:java} var sales = spark.table("sales_table") var customer = spark.table("customer_table") var finalDf = sales.join(customer, "CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), stddev("qty_sold")) finalDf.count() // returns 20 customer.select("CustomerGrpNbr").distinct().count() //returns 20 {code} The tables in hive from which the datasets are built do not change during this entire process. So why does the caching cause this problem? > Invalid data in grouped cached dataset, formed by joining a large cached > dataset with a small dataset > - > > Key: SPARK-26974 > URL: https://issues.apache.org/jira/browse/SPARK-26974 > Project: Spark > Issue Type: Bug > Components: Java API
[jira] [Created] (SPARK-26974) Invalid data in grouped cached dataset, formed by joining a large cached dataset with a small dataset
Utkarsh Sharma created SPARK-26974: -- Summary: Invalid data in grouped cached dataset, formed by joining a large cached dataset with a small dataset Key: SPARK-26974 URL: https://issues.apache.org/jira/browse/SPARK-26974 Project: Spark Issue Type: Bug Components: Java API, Spark Core, SQL Affects Versions: 2.2.0 Reporter: Utkarsh Sharma The initial datasets are derived from hive tables using the spark.table() functions. Dataset descriptions: *+Sales+* dataset (close to 10 billion rows) with the following columns (and sample rows) : ||ItemId (bigint)||CustomerId (bigint)||qty_sold (bigint)|| |1|1|20| |1|2|30| |2|1|40| +*Customer*+ Dataset (close to 5 rows) with the following columns (and sample rows): ||CustomerId (bigint)||CustomerGrpNbr (smallint)|| |1|1| |2|2| |3|1| I am doing the following steps: # Caching sales dataset with close to 10 billion rows. # Doing an inner join of 'sales' with 'customer' dataset # Doing group by on the resultant dataset, based on CustomerGrpNbr column to get sum(qty_sold) and stddev(qty_sold) vales in the customer groups. # Caching the resultant grouped dataset. # Doing a .count() on the grouped dataset. The step 5 count is supposed to return only 20, because when you do a customer.select("CustomerGroupNbr").distinct().count you get 20 values. However, you get a value of around 65,000 in step 5. Following are the commands I am running in spark-shell: {code:java} var sales = spark.table("sales_table") var customer = spark.table(“customer_table”) var finalDf = sales.join(customer, "CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), stddev("qty_sold")) sales.cache() finalDf.cache() finalDf.count() // returns around 65k rows and the count keeps on varying each // run customer.select("CustomerGrpNbr").distinct().count() //returns 20{code} I have been able to replicate the same behavior using the java api as well. This anamolous behavior disappears however, when I remove the caching statements. I.e. if i run the following in spark-shell, it works as expected: {code:java} var sales = spark.table("sales_table") var customer = spark.table(“customer_table”) var finalDf = sales.join(customer, "CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), stddev("qty_sold")) finalDf.count() // returns 20 customer.select("CustomerGrpNbr").distinct().count() //returns 20 {code} The tables in hive from which the datasets are built do not change during this entire process. So why does the caching cause this problem? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26973) Kubernetes version support strategy on test nodes / backend
[ https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775394#comment-16775394 ] shane knapp commented on SPARK-26973: - i was chatting over email w/[~eje] about this yesterday, and the TL;DR is: only one version to test against, please! here are some bullet points, in no particular order, to summarize what ~[~eje] and i discussed: * we can easily test against any version of k8s via the {{--kubernetes-version}} flag passed to {{minikube start}}, so testing against N versions shouldn't be hard. * there is a moving range of k8s versions that a specific minikube release can support (ie: minikube v.0.23.0 only supports up to k8s 1.13.1). * we are limited to *one* k8s/minikube build per node at any time, so adding tests for more than one k8s version to the suite will definitely increase resource contention. currently spark is the only minikube consumer, but some upcoming lab projects will need their own k8s integration tests. * the operational overhead of managing minikube, k8s and all of the VM-layer drivers is highly non-trivial. > Kubernetes version support strategy on test nodes / backend > --- > > Key: SPARK-26973 > URL: https://issues.apache.org/jira/browse/SPARK-26973 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes has a policy for supporting three minor releases and the current > ones are defined here: > [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] > Moving from release 1.x to 1.(x+1) happens roughly every 100 > days:[https://gravitational.com/blog/kubernetes-release-cycle.] > This has an effect on dependencies upgrade at the Spark on K8s backend and > the version of Minikube required to be supported for testing. One other issue > is what the users actually want at the given time of a release. Some popular > vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap > for releases and may not catch up fast (what is our view on this). > Follow the comments for a recent discussion on the topic: > [https://github.com/apache/spark/pull/23814.] > Clearly we need a strategy for this. > A couple of options for the current state of things: > a) Support only the last two versions, but that leaves out a version that > still receives patches. > b) Support only the latest, which makes testing easier, but leaves out other > currently maintained version. > A good strategy will optimize at least the following: > 1) percentage of users satisfied at release time. > 2) how long it takes to support the latest K8s version > 3) testing requirements eg. minikube versions used > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26974) Invalid data in grouped cached dataset, formed by joining a large cached dataset with a small dataset
[ https://issues.apache.org/jira/browse/SPARK-26974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Utkarsh Sharma updated SPARK-26974: --- Description: The initial datasets are derived from hive tables using the spark.table() functions. Dataset descriptions: *+Sales+* dataset (close to 10 billion rows) with the following columns (and sample rows) : ||ItemId (bigint)||CustomerId (bigint)||qty_sold (bigint)|| |1|1|20| |1|2|30| |2|1|40| +*Customer*+ Dataset (close to 5 rows) with the following columns (and sample rows): ||CustomerId (bigint)||CustomerGrpNbr (smallint)|| |1|1| |2|2| |3|1| I am doing the following steps: # Caching sales dataset with close to 10 billion rows. # Doing an inner join of 'sales' with 'customer' dataset # Doing group by on the resultant dataset, based on CustomerGrpNbr column to get sum(qty_sold) and stddev(qty_sold) vales in the customer groups. # Caching the resultant grouped dataset. # Doing a .count() on the grouped dataset. The step 5 count is supposed to return only 20, because when you do a customer.select("CustomerGroupNbr").distinct().count you get 20 values. However, you get a value of around 65,000 in step 5. Following are the commands I am running in spark-shell: {code:java} var sales = spark.table("sales_table") var customer = spark.table("customer_table") var finalDf = sales.join(customer, "CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), stddev("qty_sold")) sales.cache() finalDf.cache() finalDf.count() // returns around 65k rows and the count keeps on varying each // run customer.select("CustomerGrpNbr").distinct().count() //returns 20{code} I have been able to replicate the same behavior using the java api as well. This anamolous behavior disappears however, when I remove the caching statements. I.e. if i run the following in spark-shell, it works as expected: {code:java} var sales = spark.table("sales_table") var customer = spark.table("customer_table") var finalDf = sales.join(customer, "CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), stddev("qty_sold")) finalDf.count() // returns 20 customer.select("CustomerGrpNbr").distinct().count() //returns 20 {code} The tables in hive from which the datasets are built do not change during this entire process. So why does the caching cause this problem? was: The initial datasets are derived from hive tables using the spark.table() functions. Dataset descriptions: *+Sales+* dataset (close to 10 billion rows) with the following columns (and sample rows) : ||ItemId (bigint)||CustomerId (bigint)||qty_sold (bigint)|| |1|1|20| |1|2|30| |2|1|40| +*Customer*+ Dataset (close to 5 rows) with the following columns (and sample rows): ||CustomerId (bigint)||CustomerGrpNbr (smallint)|| |1|1| |2|2| |3|1| I am doing the following steps: # Caching sales dataset with close to 10 billion rows. # Doing an inner join of 'sales' with 'customer' dataset # Doing group by on the resultant dataset, based on CustomerGrpNbr column to get sum(qty_sold) and stddev(qty_sold) vales in the customer groups. # Caching the resultant grouped dataset. # Doing a .count() on the grouped dataset. The step 5 count is supposed to return only 20, because when you do a customer.select("CustomerGroupNbr").distinct().count you get 20 values. However, you get a value of around 65,000 in step 5. Following are the commands I am running in spark-shell: {code:java} var sales = spark.table("sales_table") var customer = spark.table(“customer_table”) var finalDf = sales.join(customer, "CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), stddev("qty_sold")) sales.cache() finalDf.cache() finalDf.count() // returns around 65k rows and the count keeps on varying each // run customer.select("CustomerGrpNbr").distinct().count() //returns 20{code} I have been able to replicate the same behavior using the java api as well. This anamolous behavior disappears however, when I remove the caching statements. I.e. if i run the following in spark-shell, it works as expected: {code:java} var sales = spark.table("sales_table") var customer = spark.table(“customer_table”) var finalDf = sales.join(customer, "CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), stddev("qty_sold")) finalDf.count() // returns 20 customer.select("CustomerGrpNbr").distinct().count() //returns 20 {code} The tables in hive from which the datasets are built do not change during this entire process. So why does the caching cause this problem? > Invalid data in grouped cached dataset, formed by joining a large cached > dataset with a small dataset > - > > Key: SPARK-26974 > URL: https://issues.apache.org/jira/browse/SPARK-26974 > Project: Spark > Issue Type: Bug > Components: Java
[jira] [Commented] (SPARK-26973) Kubernetes version support strategy on test nodes / backend
[ https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775355#comment-16775355 ] Sean Owen commented on SPARK-26973: --- I think I'd suggest testing against one version or else this could get complicated fast. The latest version we support is a good place to start. How about that until something tells us we miss too many big problems without more tests? > Kubernetes version support strategy on test nodes / backend > --- > > Key: SPARK-26973 > URL: https://issues.apache.org/jira/browse/SPARK-26973 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes has a policy for supporting three minor releases and the current > ones are defined here: > [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] > Moving from release 1.x to 1.(x+1) happens roughly every 100 > days:[https://gravitational.com/blog/kubernetes-release-cycle.] > This has an effect on dependencies upgrade at the Spark on K8s backend and > the version of Minikube required to be supported for testing. One other issue > is what the users actually want at the given time of a release. Some popular > vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap > for releases and may not catch up fast (what is our view on this). > Follow the comments for a recent discussion on the topic: > [https://github.com/apache/spark/pull/23814.] > Clearly we need a strategy for this. > A couple of options for the current state of things: > a) Support only the last two versions, but that leaves out a version that > still receives patches. > b) Support only the latest, which makes testing easier, but leaves out other > currently maintained version. > A good strategy will optimize at least the following: > 1) percentage of users satisfied at release time. > 2) how long it takes to support the latest K8s version > 3) testing requirements eg. minikube versions used > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26973) Kubernetes version support strategy on test nodes / backend
[ https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-26973: Description: Kubernetes has a policy for supporting three minor releases and the current ones are defined here: [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] Moving from release 1.x to 1.(x+1) happens roughly every 100 days:[https://gravitational.com/blog/kubernetes-release-cycle.] This has an effect on dependencies upgrade at the Spark on K8s backend and the version of Minikube required to be supported for testing. One other issue is what the users actually want at the given time of a release. Some popular vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for releases and may not catch up fast (what is our view on this). Follow the comments for a recent discussion on the topic: [https://github.com/apache/spark/pull/23814.] Clearly we need a strategy for this. A couple of options for the current state of things: a) Support only the last two versions, but that leaves out a version that still receives patches. b) Support only the latest, which makes testing easier, but leaves out other currently maintained version. A good strategy will optimize at least the following: 1) percentage of users satisfied at release time. 2) how long it takes to support the latest K8s version 3) testing requirements eg. minikube versions used was: Kubernetes has a policy for supporting three minor releases and the current ones are defined here: [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] Moving from release 1.x to 1.(x+1) happens roughly every 100 days:[https://gravitational.com/blog/kubernetes-release-cycle.] This has an effect on dependencies upgrade at the Spark on K8s backend and the version of Minikube required to be supported for testing. One other issue is what the users actually want at the given time of a release. Some popular vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for releases and may not catch up fast (what is our view for this). Follow the comments for a recent discussion on the topic: [https://github.com/apache/spark/pull/23814.] Clearly we need a strategy for this. A couple of options for the current state of things: a) Support only the last two versions, but that leaves out a version that still receives patches. b) Support only the latest, which makes testing easier, but leaves out other currently maintained version. A good strategy will optimize at least the following: 1) percentage of users satisfied at release time. 2) how long it takes to support the latest K8s version 3) testing requirements eg. minikube versions used > Kubernetes version support strategy on test nodes / backend > --- > > Key: SPARK-26973 > URL: https://issues.apache.org/jira/browse/SPARK-26973 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes has a policy for supporting three minor releases and the current > ones are defined here: > [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] > Moving from release 1.x to 1.(x+1) happens roughly every 100 > days:[https://gravitational.com/blog/kubernetes-release-cycle.] > This has an effect on dependencies upgrade at the Spark on K8s backend and > the version of Minikube required to be supported for testing. One other issue > is what the users actually want at the given time of a release. Some popular > vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap > for releases and may not catch up fast (what is our view on this). > Follow the comments for a recent discussion on the topic: > [https://github.com/apache/spark/pull/23814.] > Clearly we need a strategy for this. > A couple of options for the current state of things: > a) Support only the last two versions, but that leaves out a version that > still receives patches. > b) Support only the latest, which makes testing easier, but leaves out other > currently maintained version. > A good strategy will optimize at least the following: > 1) percentage of users satisfied at release time. > 2) how long it takes to support the latest K8s version > 3) testing requirements eg. minikube versions used > -- This message was sent by Atlassian JIRA (v7.6.3#76005) -
[jira] [Updated] (SPARK-26973) Kubernetes version support strategy on test nodes / backend
[ https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-26973: Description: Kubernetes has a policy for supporting three minor releases and the current ones are defined here: [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] Moving from release 1.x to 1.(x+1) happens roughly every 100 days:[https://gravitational.com/blog/kubernetes-release-cycle.] This has an effect on dependencies upgrade at the Spark on K8s backend and the version of Minikube required to be supported for testing. One other issue is what the users actually want at the given time of a release. Some popular vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for releases and may not catch up fast (what is our view for this). Follow the comments for a recent discussion on the topic: [https://github.com/apache/spark/pull/23814.] Clearly we need a strategy for this. A couple of options for the current state of things: a) Support only the last two versions, but that leaves out a version that still receives patches. b) Support only the latest, which makes testing easier, but leaves out other currently maintained version. A good strategy will optimize at least the following: 1) percentage of users satisfied at release time. 2) how long it takes to support the latest K8s version 3) testing requirements eg. minikube versions used was: Kubernetes has a policy for supporting three minor releases and the current ones are defined here: [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] Moving from release 1.x to 1.(x+1) happens roughly every 100 days:[https://gravitational.com/blog/kubernetes-release-cycle.] This has an effect on dependencies upgrade at the Spark on K8s backend and the version of Minikube required to be supported for testing. One other issue is what the users actually want at the given time of a release. Some popular vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for releases and may not catch up fast. Follow the comments for a recent discussion on the topic: [https://github.com/apache/spark/pull/23814.] Clearly we need a strategy for this. A couple of options for the current state of things: a) Support only the last two versions, but that leaves out a version that still receives patches. b) Support only the latest, which makes testing easier, but leaves out other currently maintained version. A good strategy will optimize at least the following: 1) percentage of users satisfied at release time. 2) how long it takes to support the latest K8s version 3) testing requirements eg. minikube versions used > Kubernetes version support strategy on test nodes / backend > --- > > Key: SPARK-26973 > URL: https://issues.apache.org/jira/browse/SPARK-26973 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes has a policy for supporting three minor releases and the current > ones are defined here: > [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] > Moving from release 1.x to 1.(x+1) happens roughly every 100 > days:[https://gravitational.com/blog/kubernetes-release-cycle.] > This has an effect on dependencies upgrade at the Spark on K8s backend and > the version of Minikube required to be supported for testing. One other issue > is what the users actually want at the given time of a release. Some popular > vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap > for releases and may not catch up fast (what is our view for this). > Follow the comments for a recent discussion on the topic: > [https://github.com/apache/spark/pull/23814.] > Clearly we need a strategy for this. > A couple of options for the current state of things: > a) Support only the last two versions, but that leaves out a version that > still receives patches. > b) Support only the latest, which makes testing easier, but leaves out other > currently maintained version. > A good strategy will optimize at least the following: > 1) percentage of users satisfied at release time. > 2) how long it takes to support the latest K8s version > 3) testing requirements eg. minikube versions used > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To un
[jira] [Updated] (SPARK-26973) Kubernetes version support strategy on test nodes / backend
[ https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-26973: Description: Kubernetes has a policy for supporting three minor releases and the current ones are defined here: [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] Moving from release 1.x to 1.(x+1) happens roughly every 100 days:[https://gravitational.com/blog/kubernetes-release-cycle.] This has an effect on dependencies upgrade at the Spark on K8s backend and the version of Minikube required to be supported for testing. One other issue is what the users actually want at the given time of a release. Some popular vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for releases and may not catch up fast. Follow the comments for a recent discussion on the topic: [https://github.com/apache/spark/pull/23814.] Clearly we need a strategy for this. A couple of options for the current state of things: a) Support only the last two version, but that leaves out a version that still receives patches. b) Support only the latest, which makes testing easier, but leaves out other currently maintained version. A good strategy will optimize the following: 1) percentage of users satisfied at release time. 2) how long it takes to support the latest K8s version 3) testing requirements eg. minikube versions used was: Kubernetes has a policy for supporting three minor releases and the current ones are defined here: [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] Moving from release 1.x to 1.(x+1) happens roughly every 100 days:[https://gravitational.com/blog/kubernetes-release-cycle.] This has an effect on dependencies upgrade at the Spark on K8s backend and the version of Minikube required to be supported for testing. One other issue is what the users actually want at the given time of a release. Some popular vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for releases and may not catch up fast. Follow the comments a recent discussion on the topic: [https://github.com/apache/spark/pull/23814.] Clearly we need a strategy for this. A couple of options for the current state of things: a) Support only the last two version, but that leaves out a version that still receives patches. b) Support only the latest, which makes testing easier, but leaves out other currently maintained version. A good strategy will optimize the following: 1) percentage of users satisfied at release time. 2) how long it takes to support the latest K8s version 3) testing requirements eg. minikube versions used > Kubernetes version support strategy on test nodes / backend > --- > > Key: SPARK-26973 > URL: https://issues.apache.org/jira/browse/SPARK-26973 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes has a policy for supporting three minor releases and the current > ones are defined here: > [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] > Moving from release 1.x to 1.(x+1) happens roughly every 100 > days:[https://gravitational.com/blog/kubernetes-release-cycle.] > This has an effect on dependencies upgrade at the Spark on K8s backend and > the version of Minikube required to be supported for testing. One other issue > is what the users actually want at the given time of a release. Some popular > vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap > for releases and may not catch up fast. > Follow the comments for a recent discussion on the topic: > [https://github.com/apache/spark/pull/23814.] > Clearly we need a strategy for this. > A couple of options for the current state of things: > a) Support only the last two version, but that leaves out a version that > still receives patches. > b) Support only the latest, which makes testing easier, but leaves out other > currently maintained version. > A good strategy will optimize the following: > 1) percentage of users satisfied at release time. > 2) how long it takes to support the latest K8s version > 3) testing requirements eg. minikube versions used > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: is
[jira] [Updated] (SPARK-26973) Kubernetes version support strategy on test nodes / backend
[ https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-26973: Description: Kubernetes has a policy for supporting three minor releases and the current ones are defined here: [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] Moving from release 1.x to 1.(x+1) happens roughly every 100 days:[https://gravitational.com/blog/kubernetes-release-cycle.] This has an effect on dependencies upgrade at the Spark on K8s backend and the version of Minikube required to be supported for testing. One other issue is what the users actually want at the given time of a release. Some popular vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for releases and may not catch up fast. Follow the comments for a recent discussion on the topic: [https://github.com/apache/spark/pull/23814.] Clearly we need a strategy for this. A couple of options for the current state of things: a) Support only the last two versions, but that leaves out a version that still receives patches. b) Support only the latest, which makes testing easier, but leaves out other currently maintained version. A good strategy will optimize at least the following: 1) percentage of users satisfied at release time. 2) how long it takes to support the latest K8s version 3) testing requirements eg. minikube versions used was: Kubernetes has a policy for supporting three minor releases and the current ones are defined here: [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] Moving from release 1.x to 1.(x+1) happens roughly every 100 days:[https://gravitational.com/blog/kubernetes-release-cycle.] This has an effect on dependencies upgrade at the Spark on K8s backend and the version of Minikube required to be supported for testing. One other issue is what the users actually want at the given time of a release. Some popular vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for releases and may not catch up fast. Follow the comments for a recent discussion on the topic: [https://github.com/apache/spark/pull/23814.] Clearly we need a strategy for this. A couple of options for the current state of things: a) Support only the last two versions, but that leaves out a version that still receives patches. b) Support only the latest, which makes testing easier, but leaves out other currently maintained version. A good strategy will optimize the following: 1) percentage of users satisfied at release time. 2) how long it takes to support the latest K8s version 3) testing requirements eg. minikube versions used > Kubernetes version support strategy on test nodes / backend > --- > > Key: SPARK-26973 > URL: https://issues.apache.org/jira/browse/SPARK-26973 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes has a policy for supporting three minor releases and the current > ones are defined here: > [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] > Moving from release 1.x to 1.(x+1) happens roughly every 100 > days:[https://gravitational.com/blog/kubernetes-release-cycle.] > This has an effect on dependencies upgrade at the Spark on K8s backend and > the version of Minikube required to be supported for testing. One other issue > is what the users actually want at the given time of a release. Some popular > vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap > for releases and may not catch up fast. > Follow the comments for a recent discussion on the topic: > [https://github.com/apache/spark/pull/23814.] > Clearly we need a strategy for this. > A couple of options for the current state of things: > a) Support only the last two versions, but that leaves out a version that > still receives patches. > b) Support only the latest, which makes testing easier, but leaves out other > currently maintained version. > A good strategy will optimize at least the following: > 1) percentage of users satisfied at release time. > 2) how long it takes to support the latest K8s version > 3) testing requirements eg. minikube versions used > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additi
[jira] [Updated] (SPARK-26973) Kubernetes version support strategy on test nodes / backend
[ https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-26973: Description: Kubernetes has a policy for supporting three minor releases and the current ones are defined here: [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] Moving from release 1.x to 1.(x+1) happens roughly every 100 days:[https://gravitational.com/blog/kubernetes-release-cycle.] This has an effect on dependencies upgrade at the Spark on K8s backend and the version of Minikube required to be supported for testing. One other issue is what the users actually want at the given time of a release. Some popular vendors like EKS([https://aws.amazon.com/eks/faqs/] have their own roadmap for releases. Follow the comments a recent discussion on the topic: [https://github.com/apache/spark/pull/23814.] Clearly we need a strategy for this. A couple of options for the current state of things: a) Support only the last two version, but that leaves out a version that still receives patches. b) Support only the latest, which makes testing easier, but leaves out other currently maintained version. A good strategy will optimize the following: 1) percentage of users satisfied at release time. 2) how long it takes to support the latest K8s version 3) testing requirements eg. minikube versions used was: Kubernetes has a policy for supporting three minor releases and the current are defined here: [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md].] Moving from release 1.x to 1.(x+1) happens roughly every 100 days:[https://gravitational.com/blog/kubernetes-release-cycle.] This has an effect on dependencies upgrade at the Spark on K8s backend and the version of Minikube required to be supported for testing. One other issue is what the users actually want at the given time of a release. Some popular vendors like EKS([https://aws.amazon.com/eks/faqs/] have their own roadmap for releases. Follow the comments a recent discussion on the topic: [https://github.com/apache/spark/pull/23814.] Clearly we need a strategy for this. A couple of options for the current state of things: a) Support only the last two version, but that leaves out a version that still receives patches. b) Support only the latest, which makes testing easier, but leaves out other currently maintained version. A good strategy will optimize the following: 1) percentage of users satisfied at release time. 2) how long it takes to support the latest K8s version 3) testing requirements eg. minikube versions used > Kubernetes version support strategy on test nodes / backend > --- > > Key: SPARK-26973 > URL: https://issues.apache.org/jira/browse/SPARK-26973 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes has a policy for supporting three minor releases and the current > ones are defined here: > [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] > Moving from release 1.x to 1.(x+1) happens roughly every 100 > days:[https://gravitational.com/blog/kubernetes-release-cycle.] > This has an effect on dependencies upgrade at the Spark on K8s backend and > the version of Minikube required to be supported for testing. One other issue > is what the users actually want at the given time of a release. Some popular > vendors like EKS([https://aws.amazon.com/eks/faqs/] have their own roadmap > for releases. > Follow the comments a recent discussion on the topic: > [https://github.com/apache/spark/pull/23814.] > Clearly we need a strategy for this. > A couple of options for the current state of things: > a) Support only the last two version, but that leaves out a version that > still receives patches. > b) Support only the latest, which makes testing easier, but leaves out other > currently maintained version. > A good strategy will optimize the following: > 1) percentage of users satisfied at release time. > 2) how long it takes to support the latest K8s version > 3) testing requirements eg. minikube versions used > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26973) Kubernetes version support strategy on test nodes / backend
[ https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-26973: Description: Kubernetes has a policy for supporting three minor releases and the current ones are defined here: [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] Moving from release 1.x to 1.(x+1) happens roughly every 100 days:[https://gravitational.com/blog/kubernetes-release-cycle.] This has an effect on dependencies upgrade at the Spark on K8s backend and the version of Minikube required to be supported for testing. One other issue is what the users actually want at the given time of a release. Some popular vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for releases and may not catch up fast. Follow the comments for a recent discussion on the topic: [https://github.com/apache/spark/pull/23814.] Clearly we need a strategy for this. A couple of options for the current state of things: a) Support only the last two versions, but that leaves out a version that still receives patches. b) Support only the latest, which makes testing easier, but leaves out other currently maintained version. A good strategy will optimize the following: 1) percentage of users satisfied at release time. 2) how long it takes to support the latest K8s version 3) testing requirements eg. minikube versions used was: Kubernetes has a policy for supporting three minor releases and the current ones are defined here: [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] Moving from release 1.x to 1.(x+1) happens roughly every 100 days:[https://gravitational.com/blog/kubernetes-release-cycle.] This has an effect on dependencies upgrade at the Spark on K8s backend and the version of Minikube required to be supported for testing. One other issue is what the users actually want at the given time of a release. Some popular vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for releases and may not catch up fast. Follow the comments for a recent discussion on the topic: [https://github.com/apache/spark/pull/23814.] Clearly we need a strategy for this. A couple of options for the current state of things: a) Support only the last two version, but that leaves out a version that still receives patches. b) Support only the latest, which makes testing easier, but leaves out other currently maintained version. A good strategy will optimize the following: 1) percentage of users satisfied at release time. 2) how long it takes to support the latest K8s version 3) testing requirements eg. minikube versions used > Kubernetes version support strategy on test nodes / backend > --- > > Key: SPARK-26973 > URL: https://issues.apache.org/jira/browse/SPARK-26973 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes has a policy for supporting three minor releases and the current > ones are defined here: > [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] > Moving from release 1.x to 1.(x+1) happens roughly every 100 > days:[https://gravitational.com/blog/kubernetes-release-cycle.] > This has an effect on dependencies upgrade at the Spark on K8s backend and > the version of Minikube required to be supported for testing. One other issue > is what the users actually want at the given time of a release. Some popular > vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap > for releases and may not catch up fast. > Follow the comments for a recent discussion on the topic: > [https://github.com/apache/spark/pull/23814.] > Clearly we need a strategy for this. > A couple of options for the current state of things: > a) Support only the last two versions, but that leaves out a version that > still receives patches. > b) Support only the latest, which makes testing easier, but leaves out other > currently maintained version. > A good strategy will optimize the following: > 1) percentage of users satisfied at release time. > 2) how long it takes to support the latest K8s version > 3) testing requirements eg. minikube versions used > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-ma
[jira] [Updated] (SPARK-26973) Kubernetes version support strategy on test nodes / backend
[ https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-26973: Description: Kubernetes has a policy for supporting three minor releases and the current ones are defined here: [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] Moving from release 1.x to 1.(x+1) happens roughly every 100 days:[https://gravitational.com/blog/kubernetes-release-cycle.] This has an effect on dependencies upgrade at the Spark on K8s backend and the version of Minikube required to be supported for testing. One other issue is what the users actually want at the given time of a release. Some popular vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap for releases and may not catch up fast. Follow the comments a recent discussion on the topic: [https://github.com/apache/spark/pull/23814.] Clearly we need a strategy for this. A couple of options for the current state of things: a) Support only the last two version, but that leaves out a version that still receives patches. b) Support only the latest, which makes testing easier, but leaves out other currently maintained version. A good strategy will optimize the following: 1) percentage of users satisfied at release time. 2) how long it takes to support the latest K8s version 3) testing requirements eg. minikube versions used was: Kubernetes has a policy for supporting three minor releases and the current ones are defined here: [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] Moving from release 1.x to 1.(x+1) happens roughly every 100 days:[https://gravitational.com/blog/kubernetes-release-cycle.] This has an effect on dependencies upgrade at the Spark on K8s backend and the version of Minikube required to be supported for testing. One other issue is what the users actually want at the given time of a release. Some popular vendors like EKS([https://aws.amazon.com/eks/faqs/] have their own roadmap for releases. Follow the comments a recent discussion on the topic: [https://github.com/apache/spark/pull/23814.] Clearly we need a strategy for this. A couple of options for the current state of things: a) Support only the last two version, but that leaves out a version that still receives patches. b) Support only the latest, which makes testing easier, but leaves out other currently maintained version. A good strategy will optimize the following: 1) percentage of users satisfied at release time. 2) how long it takes to support the latest K8s version 3) testing requirements eg. minikube versions used > Kubernetes version support strategy on test nodes / backend > --- > > Key: SPARK-26973 > URL: https://issues.apache.org/jira/browse/SPARK-26973 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes has a policy for supporting three minor releases and the current > ones are defined here: > [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md] > Moving from release 1.x to 1.(x+1) happens roughly every 100 > days:[https://gravitational.com/blog/kubernetes-release-cycle.] > This has an effect on dependencies upgrade at the Spark on K8s backend and > the version of Minikube required to be supported for testing. One other issue > is what the users actually want at the given time of a release. Some popular > vendors like EKS([https://aws.amazon.com/eks/faqs/]) have their own roadmap > for releases and may not catch up fast. > Follow the comments a recent discussion on the topic: > [https://github.com/apache/spark/pull/23814.] > Clearly we need a strategy for this. > A couple of options for the current state of things: > a) Support only the last two version, but that leaves out a version that > still receives patches. > b) Support only the latest, which makes testing easier, but leaves out other > currently maintained version. > A good strategy will optimize the following: > 1) percentage of users satisfied at release time. > 2) how long it takes to support the latest K8s version > 3) testing requirements eg. minikube versions used > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26973) Kubernetes version support strategy on test nodes / backend
[ https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-26973: Summary: Kubernetes version support strategy on test nodes / backend (was: Kubernetes version support strategy on test nodes and for the backend) > Kubernetes version support strategy on test nodes / backend > --- > > Key: SPARK-26973 > URL: https://issues.apache.org/jira/browse/SPARK-26973 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes has a policy for supporting three minor releases and the current > are defined here: > [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md].] > Moving from release 1.x to 1.(x+1) happens roughly every 100 > days:[https://gravitational.com/blog/kubernetes-release-cycle.] > This has an effect on dependencies upgrade at the Spark on K8s backend and > the version of Minikube required to be supported for testing. One other issue > is what the users actually want at the given time of a release. Some popular > vendors like EKS([https://aws.amazon.com/eks/faqs/] have their own roadmap > for releases. > Follow the comments a recent discussion on the topic: > [https://github.com/apache/spark/pull/23814.] > Clearly we need a strategy for this. > A couple of options for the current state of things: > a) Support only the last two version, but that leaves out a version that > still receives patches. > b) Support only the latest, which makes testing easier, but leaves out other > currently maintained version. > A good strategy will optimize the following: > 1) percentage of users satisfied at release time. > 2) how long it takes to support the latest K8s version > 3) testing requirements eg. minikube versions used > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26973) Kubernetes version support strategy on test nodes and for the backend
Stavros Kontopoulos created SPARK-26973: --- Summary: Kubernetes version support strategy on test nodes and for the backend Key: SPARK-26973 URL: https://issues.apache.org/jira/browse/SPARK-26973 Project: Spark Issue Type: Test Components: Kubernetes Affects Versions: 3.0.0 Reporter: Stavros Kontopoulos Kubernetes has a policy for supporting three minor releases and the current are defined here: [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md].] Moving from release 1.x to 1.(x+1) happens roughly every 100 days:[https://gravitational.com/blog/kubernetes-release-cycle.] This has an effect on dependencies upgrade at the Spark on K8s backend and the version of Minikube required to be supported for testing. One other issue is what the users actually want at the given time of a release. Some popular vendors like EKS([https://aws.amazon.com/eks/faqs/] have their own roadmap for releases. Follow the comments a recent discussion on the topic: [https://github.com/apache/spark/pull/23814.] Clearly we need a strategy for this. A couple of options for the current state of things: a) Support only the last two version, but that leaves out a version that still receives patches. b) Support only the latest, which makes testing easier, but leaves out other currently maintained version. A good strategy will optimize the following: 1) percentage of users satisfied at release time. 2) how long it takes to support the latest K8s version 3) testing requirements eg. minikube versions used -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26973) Kubernetes version support strategy on test nodes and for the backend
[ https://issues.apache.org/jira/browse/SPARK-26973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775339#comment-16775339 ] Stavros Kontopoulos commented on SPARK-26973: - [~foxish] [~srowen] [~shaneknapp] [~vanzin] fyi. > Kubernetes version support strategy on test nodes and for the backend > - > > Key: SPARK-26973 > URL: https://issues.apache.org/jira/browse/SPARK-26973 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Kubernetes has a policy for supporting three minor releases and the current > are defined here: > [https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md.|https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md].] > Moving from release 1.x to 1.(x+1) happens roughly every 100 > days:[https://gravitational.com/blog/kubernetes-release-cycle.] > This has an effect on dependencies upgrade at the Spark on K8s backend and > the version of Minikube required to be supported for testing. One other issue > is what the users actually want at the given time of a release. Some popular > vendors like EKS([https://aws.amazon.com/eks/faqs/] have their own roadmap > for releases. > Follow the comments a recent discussion on the topic: > [https://github.com/apache/spark/pull/23814.] > Clearly we need a strategy for this. > A couple of options for the current state of things: > a) Support only the last two version, but that leaves out a version that > still receives patches. > b) Support only the latest, which makes testing easier, but leaves out other > currently maintained version. > A good strategy will optimize the following: > 1) percentage of users satisfied at release time. > 2) how long it takes to support the latest K8s version > 3) testing requirements eg. minikube versions used > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20597) KafkaSourceProvider falls back on path as synonym for topic
[ https://issues.apache.org/jira/browse/SPARK-20597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775279#comment-16775279 ] Valeria Vasylieva commented on SPARK-20597: --- [~jlaskowski] I have added the PR for this issue, could you please look at it? Thank you. > KafkaSourceProvider falls back on path as synonym for topic > --- > > Key: SPARK-20597 > URL: https://issues.apache.org/jira/browse/SPARK-20597 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Trivial > Labels: starter > > # {{KafkaSourceProvider}} supports {{topic}} option that sets the Kafka topic > to save a DataFrame's rows to > # {{KafkaSourceProvider}} can use {{topic}} column to assign rows to Kafka > topics for writing > What seems a quite interesting option is to support {{start(path: String)}} > as the least precedence option in which {{path}} would designate the default > topic when no other options are used. > {code} > df.writeStream.format("kafka").start("topic") > {code} > See > http://apache-spark-developers-list.1001551.n3.nabble.com/KafkaSourceProvider-Why-topic-option-and-column-without-reverting-to-path-as-the-least-priority-td21458.html > for discussion -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true
[ https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean Georges Perrin updated SPARK-26972: Attachment: ComplexCsvToDataframeApp.java > Issue with CSV import and inferSchema set to true > - > > Key: SPARK-26972 > URL: https://issues.apache.org/jira/browse/SPARK-26972 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.3, 2.3.3, 2.4.0 > Environment: Java 8/Scala 2.11/MacOs >Reporter: Jean Georges Perrin >Priority: Major > Attachments: ComplexCsvToDataframeApp.java, > ComplexCsvToDataframeWithSchemaApp.java, issue.txt > > > > > Issue with CSV import and inferSchema set to true. > I found a few discrepencies while working with inferSchema set to true in CSV > ingestion. > Given the following CSV: > {{id;authorId;title;releaseDate;link}} > {{1;1;Fantastic Beasts and Where to Find Them: The Original > Screenplay;11/18/16;http://amzn.to/2kup94P}} > {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry > Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP}} > {{3;1;*The Tales of Beedle the Bard, Standard Edition (Harry > Potter)*;12/4/08;http://amzn.to/2kYezqr}} > {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition > (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n}} > {{5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the > Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT}} > {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language? }} > {{An independent study by Jean Georges Perrin, IIUG Board > Member*;12/28/16;http://amzn.to/2vBxOe1}} > {{7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav}} > {{8;3;A Connecticut Yankee in King Arthur's > Court;6/17/17;http://amzn.to/2x1NuoD}} > {{10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA}} > {{11;4;Diderot Encyclopedia: The Complete Illustrations > 1762-1777;;http://amzn.to/2i2zo3I}} > {{12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ}} > {{13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW}} > {{14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk}} > {{15;7;Soft Skills: The software developer's life > manual;12/29/14;http://amzn.to/2zNnSyn}} > {{16;8;Of Mice and Men;;http://amzn.to/2zJjXoc}} > {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style > programming*;8/28/14;http://amzn.to/2isdqoL}} > {{18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY}} > {{19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG}} > {{20;14;*Fables choisies; mises en vers par M. de La > Fontaine*;9/1/1999;http://amzn.to/2yRH10W}} > {{21;15;Discourse on Method and Meditations on First > Philosophy;6/15/1999;http://amzn.to/2hwB8zc}} > {{22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo}} > {{23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo}} > And this code: > {{Dataset df = spark.read().format("csv")}} > {{ .option("header", "true")}} > {{ .option("multiline", true)}} > {{ .option("sep", ";")}} > {{ .option("quote", "*")}} > {{ .option("dateFormat", "M/d/y")}} > {{ .option("inferSchema", true)}} > {{ .load("data/books.csv");}} > {{df.show(7);}} > {{df.printSchema();}} > h1. In Spark v2.0.1 > {{Excerpt of the dataframe content:}} > {{+---+++---++}} > {{| id|authorId| title|releaseDate| link|}} > {{+---+++---++}} > {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} > {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} > {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} > {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} > {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} > {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}} > {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}} > {{+---+++---++}} > {{only showing top 7 rows}}{{Dataframe's schema:}} > {{root}} > {{ |-- id: integer (nullable = true)}} > {{ |-- authorId: integer (nullable = true)}} > {{ |-- title: string (nullable = true)}} > {{ |-- releaseDate: string (nullable = true)}} > {{ |-- link: string (nullable = true)}} > *This is fine and the expected output*. > h1. Using Apache Spark v2.1.3 > Excerpt of the dataframe content: > {{++++---++}} > {{ | id|authorId| title|releaseDate| link|}} > {{ > ++++---++}} > {{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} > {{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} > {{ | 3| 1|The Tales of Beed...| 12/4/
[jira] [Created] (SPARK-26972) Issue with CSV import and inferSchema set to true
Jean Georges Perrin created SPARK-26972: --- Summary: Issue with CSV import and inferSchema set to true Key: SPARK-26972 URL: https://issues.apache.org/jira/browse/SPARK-26972 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.4.0, 2.3.3, 2.1.3 Environment: Java 8/Scala 2.11/MacOs Reporter: Jean Georges Perrin Issue with CSV import and inferSchema set to true. I found a few discrepencies while working with inferSchema set to true in CSV ingestion. Given the following CSV: {{id;authorId;title;releaseDate;link}} {{1;1;Fantastic Beasts and Where to Find Them: The Original Screenplay;11/18/16;http://amzn.to/2kup94P}} {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP}} {{3;1;*The Tales of Beedle the Bard, Standard Edition (Harry Potter)*;12/4/08;http://amzn.to/2kYezqr}} {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n}} {{5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT}} {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language? }} {{An independent study by Jean Georges Perrin, IIUG Board Member*;12/28/16;http://amzn.to/2vBxOe1}} {{7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav}} {{8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD}} {{10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA}} {{11;4;Diderot Encyclopedia: The Complete Illustrations 1762-1777;;http://amzn.to/2i2zo3I}} {{12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ}} {{13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW}} {{14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk}} {{15;7;Soft Skills: The software developer's life manual;12/29/14;http://amzn.to/2zNnSyn}} {{16;8;Of Mice and Men;;http://amzn.to/2zJjXoc}} {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style programming*;8/28/14;http://amzn.to/2isdqoL}} {{18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY}} {{19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG}} {{20;14;*Fables choisies; mises en vers par M. de La Fontaine*;9/1/1999;http://amzn.to/2yRH10W}} {{21;15;Discourse on Method and Meditations on First Philosophy;6/15/1999;http://amzn.to/2hwB8zc}} {{22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo}} {{23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo}} And this code: {{Dataset df = spark.read().format("csv")}} {{ .option("header", "true")}} {{ .option("multiline", true)}} {{ .option("sep", ";")}} {{ .option("quote", "*")}} {{ .option("dateFormat", "M/d/y")}} {{ .option("inferSchema", true)}} {{ .load("data/books.csv");}} {{df.show(7);}} {{df.printSchema();}} h1. In Spark v2.0.1 {{Excerpt of the dataframe content:}} {{+---+++---++}} {{| id|authorId| title|releaseDate| link|}} {{+---+++---++}} {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}} {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}} {{+---+++---++}} {{only showing top 7 rows}}{{Dataframe's schema:}} {{root}} {{ |-- id: integer (nullable = true)}} {{ |-- authorId: integer (nullable = true)}} {{ |-- title: string (nullable = true)}} {{ |-- releaseDate: string (nullable = true)}} {{ |-- link: string (nullable = true)}} *This is fine and the expected output*. h1. Using Apache Spark v2.1.3 Excerpt of the dataframe content: {{++++---++}} {{ | id|authorId| title|releaseDate| link|}} {{ ++++---++}} {{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} {{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} {{ | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} {{ | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} {{ | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} {{ | 6| 2|Development Tools...| null| null|}} {{ |An independent st...|12/28/16|http://amzn.to/2v...| null| null|}} {{ ++++---++}} {{ only showing top 7 rows}}{{Dataframe's schema:}} {{ root}} {{ |-- id: string (nullable = true)}} {{ |-- authorId: string (nullable = true)}} {{
[jira] [Commented] (SPARK-26972) Issue with CSV import and inferSchema set to true
[ https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775218#comment-16775218 ] Jean Georges Perrin commented on SPARK-26972: - I added the code as attachments, Jira is breaking my formatting :( > Issue with CSV import and inferSchema set to true > - > > Key: SPARK-26972 > URL: https://issues.apache.org/jira/browse/SPARK-26972 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.3, 2.3.3, 2.4.0 > Environment: Java 8/Scala 2.11/MacOs >Reporter: Jean Georges Perrin >Priority: Major > Attachments: ComplexCsvToDataframeApp.java, > ComplexCsvToDataframeWithSchemaApp.java, books.csv, issue.txt, pom.xml > > > > > Issue with CSV import and inferSchema set to true. > I found a few discrepencies while working with inferSchema set to true in CSV > ingestion. > Given the following CSV: > {{id;authorId;title;releaseDate;link}} > {{1;1;Fantastic Beasts and Where to Find Them: The Original > Screenplay;11/18/16;http://amzn.to/2kup94P}} > {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry > Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP}} > {{3;1;*The Tales of Beedle the Bard, Standard Edition (Harry > Potter)*;12/4/08;http://amzn.to/2kYezqr}} > {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition > (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n}} > {{5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the > Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT}} > {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language? }} > {{An independent study by Jean Georges Perrin, IIUG Board > Member*;12/28/16;http://amzn.to/2vBxOe1}} > {{7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav}} > {{8;3;A Connecticut Yankee in King Arthur's > Court;6/17/17;http://amzn.to/2x1NuoD}} > {{10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA}} > {{11;4;Diderot Encyclopedia: The Complete Illustrations > 1762-1777;;http://amzn.to/2i2zo3I}} > {{12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ}} > {{13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW}} > {{14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk}} > {{15;7;Soft Skills: The software developer's life > manual;12/29/14;http://amzn.to/2zNnSyn}} > {{16;8;Of Mice and Men;;http://amzn.to/2zJjXoc}} > {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style > programming*;8/28/14;http://amzn.to/2isdqoL}} > {{18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY}} > {{19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG}} > {{20;14;*Fables choisies; mises en vers par M. de La > Fontaine*;9/1/1999;http://amzn.to/2yRH10W}} > {{21;15;Discourse on Method and Meditations on First > Philosophy;6/15/1999;http://amzn.to/2hwB8zc}} > {{22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo}} > {{23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo}} > And this code: > {{Dataset df = spark.read().format("csv")}} > {{ .option("header", "true")}} > {{ .option("multiline", true)}} > {{ .option("sep", ";")}} > {{ .option("quote", "*")}} > {{ .option("dateFormat", "M/d/y")}} > {{ .option("inferSchema", true)}} > {{ .load("data/books.csv");}} > {{df.show(7);}} > {{df.printSchema();}} > h1. In Spark v2.0.1 > {{Excerpt of the dataframe content:}} > {{+---+++---++}} > {{| id|authorId| title|releaseDate| link|}} > {{+---+++---++}} > {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} > {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} > {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} > {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} > {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} > {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}} > {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}} > {{+---+++---++}} > {{only showing top 7 rows}}{{Dataframe's schema:}} > {{root}} > {{ |-- id: integer (nullable = true)}} > {{ |-- authorId: integer (nullable = true)}} > {{ |-- title: string (nullable = true)}} > {{ |-- releaseDate: string (nullable = true)}} > {{ |-- link: string (nullable = true)}} > *This is fine and the expected output*. > h1. Using Apache Spark v2.1.3 > Excerpt of the dataframe content: > {{++++---++}} > {{ | id|authorId| title|releaseDate| link|}} > {{ > ++++---++}} > {{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} > {{ |
[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true
[ https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean Georges Perrin updated SPARK-26972: Attachment: pom.xml > Issue with CSV import and inferSchema set to true > - > > Key: SPARK-26972 > URL: https://issues.apache.org/jira/browse/SPARK-26972 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.3, 2.3.3, 2.4.0 > Environment: Java 8/Scala 2.11/MacOs >Reporter: Jean Georges Perrin >Priority: Major > Attachments: ComplexCsvToDataframeApp.java, > ComplexCsvToDataframeWithSchemaApp.java, books.csv, issue.txt, pom.xml > > > > > Issue with CSV import and inferSchema set to true. > I found a few discrepencies while working with inferSchema set to true in CSV > ingestion. > Given the following CSV: > {{id;authorId;title;releaseDate;link}} > {{1;1;Fantastic Beasts and Where to Find Them: The Original > Screenplay;11/18/16;http://amzn.to/2kup94P}} > {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry > Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP}} > {{3;1;*The Tales of Beedle the Bard, Standard Edition (Harry > Potter)*;12/4/08;http://amzn.to/2kYezqr}} > {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition > (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n}} > {{5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the > Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT}} > {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language? }} > {{An independent study by Jean Georges Perrin, IIUG Board > Member*;12/28/16;http://amzn.to/2vBxOe1}} > {{7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav}} > {{8;3;A Connecticut Yankee in King Arthur's > Court;6/17/17;http://amzn.to/2x1NuoD}} > {{10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA}} > {{11;4;Diderot Encyclopedia: The Complete Illustrations > 1762-1777;;http://amzn.to/2i2zo3I}} > {{12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ}} > {{13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW}} > {{14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk}} > {{15;7;Soft Skills: The software developer's life > manual;12/29/14;http://amzn.to/2zNnSyn}} > {{16;8;Of Mice and Men;;http://amzn.to/2zJjXoc}} > {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style > programming*;8/28/14;http://amzn.to/2isdqoL}} > {{18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY}} > {{19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG}} > {{20;14;*Fables choisies; mises en vers par M. de La > Fontaine*;9/1/1999;http://amzn.to/2yRH10W}} > {{21;15;Discourse on Method and Meditations on First > Philosophy;6/15/1999;http://amzn.to/2hwB8zc}} > {{22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo}} > {{23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo}} > And this code: > {{Dataset df = spark.read().format("csv")}} > {{ .option("header", "true")}} > {{ .option("multiline", true)}} > {{ .option("sep", ";")}} > {{ .option("quote", "*")}} > {{ .option("dateFormat", "M/d/y")}} > {{ .option("inferSchema", true)}} > {{ .load("data/books.csv");}} > {{df.show(7);}} > {{df.printSchema();}} > h1. In Spark v2.0.1 > {{Excerpt of the dataframe content:}} > {{+---+++---++}} > {{| id|authorId| title|releaseDate| link|}} > {{+---+++---++}} > {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} > {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} > {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} > {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} > {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} > {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}} > {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}} > {{+---+++---++}} > {{only showing top 7 rows}}{{Dataframe's schema:}} > {{root}} > {{ |-- id: integer (nullable = true)}} > {{ |-- authorId: integer (nullable = true)}} > {{ |-- title: string (nullable = true)}} > {{ |-- releaseDate: string (nullable = true)}} > {{ |-- link: string (nullable = true)}} > *This is fine and the expected output*. > h1. Using Apache Spark v2.1.3 > Excerpt of the dataframe content: > {{++++---++}} > {{ | id|authorId| title|releaseDate| link|}} > {{ > ++++---++}} > {{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} > {{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} > {{ | 3| 1|The Tales of Beed...| 12/4/08
[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true
[ https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean Georges Perrin updated SPARK-26972: Attachment: books.csv > Issue with CSV import and inferSchema set to true > - > > Key: SPARK-26972 > URL: https://issues.apache.org/jira/browse/SPARK-26972 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.3, 2.3.3, 2.4.0 > Environment: Java 8/Scala 2.11/MacOs >Reporter: Jean Georges Perrin >Priority: Major > Attachments: ComplexCsvToDataframeApp.java, > ComplexCsvToDataframeWithSchemaApp.java, books.csv, issue.txt, pom.xml > > > > > Issue with CSV import and inferSchema set to true. > I found a few discrepencies while working with inferSchema set to true in CSV > ingestion. > Given the following CSV: > {{id;authorId;title;releaseDate;link}} > {{1;1;Fantastic Beasts and Where to Find Them: The Original > Screenplay;11/18/16;http://amzn.to/2kup94P}} > {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry > Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP}} > {{3;1;*The Tales of Beedle the Bard, Standard Edition (Harry > Potter)*;12/4/08;http://amzn.to/2kYezqr}} > {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition > (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n}} > {{5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the > Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT}} > {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language? }} > {{An independent study by Jean Georges Perrin, IIUG Board > Member*;12/28/16;http://amzn.to/2vBxOe1}} > {{7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav}} > {{8;3;A Connecticut Yankee in King Arthur's > Court;6/17/17;http://amzn.to/2x1NuoD}} > {{10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA}} > {{11;4;Diderot Encyclopedia: The Complete Illustrations > 1762-1777;;http://amzn.to/2i2zo3I}} > {{12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ}} > {{13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW}} > {{14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk}} > {{15;7;Soft Skills: The software developer's life > manual;12/29/14;http://amzn.to/2zNnSyn}} > {{16;8;Of Mice and Men;;http://amzn.to/2zJjXoc}} > {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style > programming*;8/28/14;http://amzn.to/2isdqoL}} > {{18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY}} > {{19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG}} > {{20;14;*Fables choisies; mises en vers par M. de La > Fontaine*;9/1/1999;http://amzn.to/2yRH10W}} > {{21;15;Discourse on Method and Meditations on First > Philosophy;6/15/1999;http://amzn.to/2hwB8zc}} > {{22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo}} > {{23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo}} > And this code: > {{Dataset df = spark.read().format("csv")}} > {{ .option("header", "true")}} > {{ .option("multiline", true)}} > {{ .option("sep", ";")}} > {{ .option("quote", "*")}} > {{ .option("dateFormat", "M/d/y")}} > {{ .option("inferSchema", true)}} > {{ .load("data/books.csv");}} > {{df.show(7);}} > {{df.printSchema();}} > h1. In Spark v2.0.1 > {{Excerpt of the dataframe content:}} > {{+---+++---++}} > {{| id|authorId| title|releaseDate| link|}} > {{+---+++---++}} > {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} > {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} > {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} > {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} > {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} > {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}} > {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}} > {{+---+++---++}} > {{only showing top 7 rows}}{{Dataframe's schema:}} > {{root}} > {{ |-- id: integer (nullable = true)}} > {{ |-- authorId: integer (nullable = true)}} > {{ |-- title: string (nullable = true)}} > {{ |-- releaseDate: string (nullable = true)}} > {{ |-- link: string (nullable = true)}} > *This is fine and the expected output*. > h1. Using Apache Spark v2.1.3 > Excerpt of the dataframe content: > {{++++---++}} > {{ | id|authorId| title|releaseDate| link|}} > {{ > ++++---++}} > {{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} > {{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} > {{ | 3| 1|The Tales of Beed...| 12/4/
[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true
[ https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean Georges Perrin updated SPARK-26972: Attachment: issue.txt > Issue with CSV import and inferSchema set to true > - > > Key: SPARK-26972 > URL: https://issues.apache.org/jira/browse/SPARK-26972 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.3, 2.3.3, 2.4.0 > Environment: Java 8/Scala 2.11/MacOs >Reporter: Jean Georges Perrin >Priority: Major > Attachments: issue.txt > > > > > Issue with CSV import and inferSchema set to true. > I found a few discrepencies while working with inferSchema set to true in CSV > ingestion. > Given the following CSV: > {{id;authorId;title;releaseDate;link}} > {{1;1;Fantastic Beasts and Where to Find Them: The Original > Screenplay;11/18/16;http://amzn.to/2kup94P}} > {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry > Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP}} > {{3;1;*The Tales of Beedle the Bard, Standard Edition (Harry > Potter)*;12/4/08;http://amzn.to/2kYezqr}} > {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition > (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n}} > {{5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the > Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT}} > {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language? }} > {{An independent study by Jean Georges Perrin, IIUG Board > Member*;12/28/16;http://amzn.to/2vBxOe1}} > {{7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav}} > {{8;3;A Connecticut Yankee in King Arthur's > Court;6/17/17;http://amzn.to/2x1NuoD}} > {{10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA}} > {{11;4;Diderot Encyclopedia: The Complete Illustrations > 1762-1777;;http://amzn.to/2i2zo3I}} > {{12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ}} > {{13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW}} > {{14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk}} > {{15;7;Soft Skills: The software developer's life > manual;12/29/14;http://amzn.to/2zNnSyn}} > {{16;8;Of Mice and Men;;http://amzn.to/2zJjXoc}} > {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style > programming*;8/28/14;http://amzn.to/2isdqoL}} > {{18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY}} > {{19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG}} > {{20;14;*Fables choisies; mises en vers par M. de La > Fontaine*;9/1/1999;http://amzn.to/2yRH10W}} > {{21;15;Discourse on Method and Meditations on First > Philosophy;6/15/1999;http://amzn.to/2hwB8zc}} > {{22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo}} > {{23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo}} > And this code: > {{Dataset df = spark.read().format("csv")}} > {{ .option("header", "true")}} > {{ .option("multiline", true)}} > {{ .option("sep", ";")}} > {{ .option("quote", "*")}} > {{ .option("dateFormat", "M/d/y")}} > {{ .option("inferSchema", true)}} > {{ .load("data/books.csv");}} > {{df.show(7);}} > {{df.printSchema();}} > h1. In Spark v2.0.1 > {{Excerpt of the dataframe content:}} > {{+---+++---++}} > {{| id|authorId| title|releaseDate| link|}} > {{+---+++---++}} > {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} > {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} > {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} > {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} > {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} > {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}} > {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}} > {{+---+++---++}} > {{only showing top 7 rows}}{{Dataframe's schema:}} > {{root}} > {{ |-- id: integer (nullable = true)}} > {{ |-- authorId: integer (nullable = true)}} > {{ |-- title: string (nullable = true)}} > {{ |-- releaseDate: string (nullable = true)}} > {{ |-- link: string (nullable = true)}} > *This is fine and the expected output*. > h1. Using Apache Spark v2.1.3 > Excerpt of the dataframe content: > {{++++---++}} > {{ | id|authorId| title|releaseDate| link|}} > {{ > ++++---++}} > {{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} > {{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} > {{ | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} > {{ | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} >
[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true
[ https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean Georges Perrin updated SPARK-26972: Attachment: ComplexCsvToDataframeWithSchemaApp.java > Issue with CSV import and inferSchema set to true > - > > Key: SPARK-26972 > URL: https://issues.apache.org/jira/browse/SPARK-26972 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.3, 2.3.3, 2.4.0 > Environment: Java 8/Scala 2.11/MacOs >Reporter: Jean Georges Perrin >Priority: Major > Attachments: ComplexCsvToDataframeApp.java, > ComplexCsvToDataframeWithSchemaApp.java, issue.txt > > > > > Issue with CSV import and inferSchema set to true. > I found a few discrepencies while working with inferSchema set to true in CSV > ingestion. > Given the following CSV: > {{id;authorId;title;releaseDate;link}} > {{1;1;Fantastic Beasts and Where to Find Them: The Original > Screenplay;11/18/16;http://amzn.to/2kup94P}} > {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry > Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP}} > {{3;1;*The Tales of Beedle the Bard, Standard Edition (Harry > Potter)*;12/4/08;http://amzn.to/2kYezqr}} > {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition > (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n}} > {{5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the > Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT}} > {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language? }} > {{An independent study by Jean Georges Perrin, IIUG Board > Member*;12/28/16;http://amzn.to/2vBxOe1}} > {{7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav}} > {{8;3;A Connecticut Yankee in King Arthur's > Court;6/17/17;http://amzn.to/2x1NuoD}} > {{10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA}} > {{11;4;Diderot Encyclopedia: The Complete Illustrations > 1762-1777;;http://amzn.to/2i2zo3I}} > {{12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ}} > {{13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW}} > {{14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk}} > {{15;7;Soft Skills: The software developer's life > manual;12/29/14;http://amzn.to/2zNnSyn}} > {{16;8;Of Mice and Men;;http://amzn.to/2zJjXoc}} > {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style > programming*;8/28/14;http://amzn.to/2isdqoL}} > {{18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY}} > {{19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG}} > {{20;14;*Fables choisies; mises en vers par M. de La > Fontaine*;9/1/1999;http://amzn.to/2yRH10W}} > {{21;15;Discourse on Method and Meditations on First > Philosophy;6/15/1999;http://amzn.to/2hwB8zc}} > {{22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo}} > {{23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo}} > And this code: > {{Dataset df = spark.read().format("csv")}} > {{ .option("header", "true")}} > {{ .option("multiline", true)}} > {{ .option("sep", ";")}} > {{ .option("quote", "*")}} > {{ .option("dateFormat", "M/d/y")}} > {{ .option("inferSchema", true)}} > {{ .load("data/books.csv");}} > {{df.show(7);}} > {{df.printSchema();}} > h1. In Spark v2.0.1 > {{Excerpt of the dataframe content:}} > {{+---+++---++}} > {{| id|authorId| title|releaseDate| link|}} > {{+---+++---++}} > {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} > {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} > {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}} > {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}} > {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}} > {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}} > {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}} > {{+---+++---++}} > {{only showing top 7 rows}}{{Dataframe's schema:}} > {{root}} > {{ |-- id: integer (nullable = true)}} > {{ |-- authorId: integer (nullable = true)}} > {{ |-- title: string (nullable = true)}} > {{ |-- releaseDate: string (nullable = true)}} > {{ |-- link: string (nullable = true)}} > *This is fine and the expected output*. > h1. Using Apache Spark v2.1.3 > Excerpt of the dataframe content: > {{++++---++}} > {{ | id|authorId| title|releaseDate| link|}} > {{ > ++++---++}} > {{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}} > {{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}} > {{ | 3| 1|The Tales of Beed
[jira] [Commented] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple ti
[ https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775188#comment-16775188 ] Parth Gandhi commented on SPARK-25250: -- [~Ngone51] I understand that you had a proposal and we were actively discussing on various solutions in the PR #22806 , but however, I have been working on that PR tirelessly for a few months and we still have an ongoing discussion going on there. Any specific reasons as to why did you create your own PR for the same issue? WDYT [~irashid] [~cloud_fan] ? > Race condition with tasks running when new attempt for same stage is created > leads to other task in the next attempt running on the same partition id > retry multiple times > -- > > Key: SPARK-25250 > URL: https://issues.apache.org/jira/browse/SPARK-25250 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.3.1 >Reporter: Parth Gandhi >Priority: Major > > We recently had a scenario where a race condition occurred when a task from > previous stage attempt just finished before new attempt for the same stage > was created due to fetch failure, so the new task created in the second > attempt on the same partition id was retrying multiple times due to > TaskCommitDenied Exception without realizing that the task in earlier attempt > was already successful. > For example, consider a task with partition id 9000 and index 9000 running in > stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. > Just within this timespan, the above task completes successfully, thus, > marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has > not yet been created, the taskset info for that stage is not available to the > TaskScheduler so, naturally, the partition id 9000 has not been marked > completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same > partition id 9000. This task fails due to CommitDeniedException and since, it > does not see the corresponding partition id as been marked successful, it > keeps retrying multiple times until the job finally succeeds. It doesn't > cause any job failures because the DAG scheduler is tracking the partitions > separate from the task set managers. > > Steps to Reproduce: > # Run any large job involving shuffle operation. > # When the ShuffleMap stage finishes and the ResultStage begins running, > cause this stage to throw a fetch failure exception(Try deleting certain > shuffle files on any host). > # Observe the task attempt numbers for the next stage attempt. Please note > that this issue is an intermittent one, so it might not happen all the time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26971) How to read delimiter (Cedilla) in spark RDD and Dataframes
Babu created SPARK-26971: Summary: How to read delimiter (Cedilla) in spark RDD and Dataframes Key: SPARK-26971 URL: https://issues.apache.org/jira/browse/SPARK-26971 Project: Spark Issue Type: Question Components: PySpark Affects Versions: 1.6.0 Reporter: Babu I am trying to read a cedilla delimited HDFS Text file. I am getting the below error, did any one face similar issue? {{hadoop fs -cat test_file.dat }} {{1ÇCelvelandÇOhio 2ÇDurhamÇNC 3ÇDallasÇTexas }} {{>>> rdd = sc.textFile("test_file.dat") }} {{>>> rdd.collect() [u'1\xc7Celveland\xc7Ohio', u'2\xc7Durham\xc7NC', u'3Dallas\xc7Texas'] }} {{>>> rdd.map(lambda p: p.split("\xc7")).collect() UnicodeDecodeError: 'ascii' codec can't decode byte 0xc7 in position 0: ordinal not in range(128) }} {{>>> sqlContext.read.format("text").option("delimiter","Ç").option("encoding","ISO-8859").load("/user/cloudera/test_file.dat").show() }} |1ÇCelvelandÇOhio| {{2ÇDurhamÇNC}} {{ 3DallasÇTexas}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26945) Python streaming tests flaky while cleaning temp directories after StreamingQuery.stop
[ https://issues.apache.org/jira/browse/SPARK-26945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775125#comment-16775125 ] Alessandro Bellina commented on SPARK-26945: [~hyukjin.kwon] thanks for taking a look. Seems like q.processAllAvailable is designed for this use case. > Python streaming tests flaky while cleaning temp directories after > StreamingQuery.stop > -- > > Key: SPARK-26945 > URL: https://issues.apache.org/jira/browse/SPARK-26945 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Alessandro Bellina >Priority: Minor > > From the test code, it seems like the `shmutil.rmtree` function is trying to > delete a directory, but there's likely another thread adding entries to a > directory, so when it gets to `os.rmdir(path)` it blows up. I think the test > (and other streaming tests) should call `q.awaitTermination` after `q.stop`, > before going on. I'll file a separate jira. > {noformat} > ERROR: test_query_manager_await_termination > (pyspark.sql.tests.test_streaming.StreamingTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests/test_streaming.py", > line 259, in test_query_manager_await_termination > shutil.rmtree(tmpPath) > File "/home/anaconda/lib/python2.7/shutil.py", line 256, in rmtree > onerror(os.rmdir, path, sys.exc_info()) > File "/home/anaconda/lib/python2.7/shutil.py", line 254, in rmtree > os.rmdir(path) > OSError: [Errno 39] Directory not empty: > '/home/jenkins/workspace/SparkPullRequestBuilder/python/target/072153bd-f981-47be-bda2-e2b657a16f65/tmp4WGp7n'{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775117#comment-16775117 ] Alessandro Bellina edited comment on SPARK-26944 at 2/22/19 1:07 PM: - Hmm, I have a subsequent build from the same PR, and I don't see a link to the python tests either. Maybe I am looking in the wrong place? [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102590/artifact/] was (Author: abellina): Hmm, I have a subsequent build from the same PR, and I don't see a link to the python tests either. Maybe I am looking in the wrong place? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102590/artifact/ > Python unit-tests.log not available in artifacts for a build in Jenkins > --- > > Key: SPARK-26944 > URL: https://issues.apache.org/jira/browse/SPARK-26944 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Alessandro Bellina >Priority: Minor > > I had a pr where the python unit tests failed. The tests point at the > `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, > but I can't get to that from jenkins UI it seems (are all prs writing to the > same file?). > {code:java} > > Running PySpark tests > > Running PySpark tests. Output is in > /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code} > For reference, please see this build: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console > This Jira is to make it available under the artifacts for each build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775117#comment-16775117 ] Alessandro Bellina commented on SPARK-26944: Hmm, I have a subsequent build from the same PR, and I don't see a link to the python tests either. Maybe I am looking in the wrong place? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102590/artifact/ > Python unit-tests.log not available in artifacts for a build in Jenkins > --- > > Key: SPARK-26944 > URL: https://issues.apache.org/jira/browse/SPARK-26944 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Alessandro Bellina >Priority: Minor > > I had a pr where the python unit tests failed. The tests point at the > `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, > but I can't get to that from jenkins UI it seems (are all prs writing to the > same file?). > {code:java} > > Running PySpark tests > > Running PySpark tests. Output is in > /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code} > For reference, please see this build: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console > This Jira is to make it available under the artifacts for each build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer
[ https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Crosby updated SPARK-26970: -- Description: The Interaction transformer [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala] is missing from the set of pyspark feature transformers [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py] This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} was: The Interaction transformer [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] is missing from the set of pyspark feature transformers [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)] This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} > Can't load PipelineModel that was created in Scala with Python due to missing > Interaction transformer > - > > Key: SPARK-26970 > URL: https://issues.apache.org/jira/browse/SPARK-26970 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Andrew Crosby >Priority: Major > > The Interaction transformer > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala] > is missing from the set of pyspark feature transformers > [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py] > > This means that it is impossible to create a model that includes an > Interaction transformer with pyspark. It also means that attempting to load a > PipelineModel created in Scala that includes an Interaction transformer with > pyspark fails with the following error: > {code:java} > AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer
[ https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Crosby updated SPARK-26970: -- Description: The Interaction transformer [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] is missing from the set of pyspark feature transformers [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)] This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} was: The Interaction transformer ( [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] ) is missing from the set of pyspark feature transformers ( [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)] ). This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} > Can't load PipelineModel that was created in Scala with Python due to missing > Interaction transformer > - > > Key: SPARK-26970 > URL: https://issues.apache.org/jira/browse/SPARK-26970 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Andrew Crosby >Priority: Major > > The Interaction transformer > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] > is missing from the set of pyspark feature transformers > [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)] > > This means that it is impossible to create a model that includes an > Interaction transformer with pyspark. It also means that attempting to load a > PipelineModel created in Scala that includes an Interaction transformer with > pyspark fails with the following error: > {code:java} > AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer
[ https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Crosby updated SPARK-26970: -- Description: The Interaction transformer ( [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] ) is missing from the set of pyspark feature transformers ( [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)] ). This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} was: The Interaction transformer ([https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] is missing from the set of pyspark feature transformers ([https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]. This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} > Can't load PipelineModel that was created in Scala with Python due to missing > Interaction transformer > - > > Key: SPARK-26970 > URL: https://issues.apache.org/jira/browse/SPARK-26970 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Andrew Crosby >Priority: Major > > The Interaction transformer ( > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] > ) is missing from the set of pyspark feature transformers ( > [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)] > ). > > This means that it is impossible to create a model that includes an > Interaction transformer with pyspark. It also means that attempting to load a > PipelineModel created in Scala that includes an Interaction transformer with > pyspark fails with the following error: > {code:java} > AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer
Andrew Crosby created SPARK-26970: - Summary: Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer Key: SPARK-26970 URL: https://issues.apache.org/jira/browse/SPARK-26970 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 2.4.0 Reporter: Andrew Crosby The Interaction transformer ([https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] is missing from the set of pyspark feature transformers ([https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]. This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26969) [Spark] Using ODBC not able to see the data in table when datatype is decimal
[ https://issues.apache.org/jira/browse/SPARK-26969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774994#comment-16774994 ] Sujith commented on SPARK-26969: i will further analyze the issue and raise a PR if required. thanks > [Spark] Using ODBC not able to see the data in table when datatype is decimal > - > > Key: SPARK-26969 > URL: https://issues.apache.org/jira/browse/SPARK-26969 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.4.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > # Using odbc rpm file install odbc > # connect to odbc using isql -v spark2xsingle > # SQL> create table t1_t(id decimal(15,2)); > # SQL> insert into t1_t values(15); > # > SQL> select * from t1_t; > +-+ > | id | > +-+ > +-+ Actual output is empty > Note: When creating table of int data type select is giving result as below > SQL> create table test_t1(id int); > SQL> insert into test_t1 values(10); > SQL> select * from test_t1; > ++ > | id | > ++ > | 10 | > ++ > Needs to handle for decimal case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24417) Build and Run Spark on JDK11
[ https://issues.apache.org/jira/browse/SPARK-24417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774954#comment-16774954 ] M. Le Bihan edited comment on SPARK-24417 at 2/22/19 9:58 AM: -- It becomes really troublesome to see Java 12 coming in few weeks while _Spark_ that is somewhat an impressive development in term of technology is hold on a JVM of year 2014. I have three questions, please : 1) What version of Spark will become compatible with Java 11 ? 2.4.1, 2.4.2 or 3.0.0 ? 2) If Java 11 compatibility is postponed to Spark 3.0.0, when Spark 3.0.0 is planned to be released ? 3) Will Spark become fully compatible with standard, classical, normal Java then, or will it keep some kind of system programming that might keep him in jeopardy ? In one word : will he suffer the same troubles when attempting to run with Java 12, 13, 14 ? Since the coming of Java 9, now Java 11, and at the door of Java 12, 18 months have passed. Can we have a date where Java 11 (and Java 12) compatibility will be available please ? was (Author: mlebihan): It becomes really troublesome to see Java 12 coming in few weeks while _Spark_ that is somewhat an impressive development in term of technology is hold on a JVM of year 2014. I have three questions, please : 1) What version of Spark will become compatible with Java 11 ? 2.4.1, 2.4.2 or 3.0.0 ? 2) If Java 11 compatibility is postponed to Spark 3.0.0, when Spark 3.0.0 is planned to be released ? 3) Will Spark become fully compatible with standard, classical, normal Java then, or will it keep some kind of system programming that might keep him in jeopardy ? In one word : will he suffer the same troubles when attempting to run with Java 12, 13, 14 ? Since the coming of Java 9, now Java 11, and at the door of Java 12, 18 months have passed. Can we have a date for Java 11 (and Java 12) compatibility will be available please ? > Build and Run Spark on JDK11 > > > Key: SPARK-24417 > URL: https://issues.apache.org/jira/browse/SPARK-24417 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 2.3.0 >Reporter: DB Tsai >Priority: Major > > This is an umbrella JIRA for Apache Spark to support JDK11 > As JDK8 is reaching EOL, and JDK9 and 10 are already end of life, per > community discussion, we will skip JDK9 and 10 to support JDK 11 directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple ti
[ https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774957#comment-16774957 ] Apache Spark commented on SPARK-25250: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/23871 > Race condition with tasks running when new attempt for same stage is created > leads to other task in the next attempt running on the same partition id > retry multiple times > -- > > Key: SPARK-25250 > URL: https://issues.apache.org/jira/browse/SPARK-25250 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.3.1 >Reporter: Parth Gandhi >Priority: Major > > We recently had a scenario where a race condition occurred when a task from > previous stage attempt just finished before new attempt for the same stage > was created due to fetch failure, so the new task created in the second > attempt on the same partition id was retrying multiple times due to > TaskCommitDenied Exception without realizing that the task in earlier attempt > was already successful. > For example, consider a task with partition id 9000 and index 9000 running in > stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. > Just within this timespan, the above task completes successfully, thus, > marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has > not yet been created, the taskset info for that stage is not available to the > TaskScheduler so, naturally, the partition id 9000 has not been marked > completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same > partition id 9000. This task fails due to CommitDeniedException and since, it > does not see the corresponding partition id as been marked successful, it > keeps retrying multiple times until the job finally succeeds. It doesn't > cause any job failures because the DAG scheduler is tracking the partitions > separate from the task set managers. > > Steps to Reproduce: > # Run any large job involving shuffle operation. > # When the ShuffleMap stage finishes and the ResultStage begins running, > cause this stage to throw a fetch failure exception(Try deleting certain > shuffle files on any host). > # Observe the task attempt numbers for the next stage attempt. Please note > that this issue is an intermittent one, so it might not happen all the time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24417) Build and Run Spark on JDK11
[ https://issues.apache.org/jira/browse/SPARK-24417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774954#comment-16774954 ] M. Le Bihan commented on SPARK-24417: - It becomes really troublesome to see Java 12 coming in few weeks while _Spark_ that is somewhat an impressive development in term of technology is hold on a JVM of year 2014. I have three questions, please : 1) What version of Spark will become compatible with Java 11 ? 2.4.1, 2.4.2 or 3.0.0 ? 2) If Java 11 compatibility is postponed to Spark 3.0.0, when Spark 3.0.0 is planned to be released ? 3) Will Spark become fully compatible with standard, classical, normal Java then, or will it keep some kind of system programming that might keep him in jeopardy ? In one word : will he suffer the same troubles when attempting to run with Java 12, 13, 14 ? Since the coming of Java 9, now Java 11, and at the door of Java 12, 18 months have passed. Can we have a date for Java 11 (and Java 12) compatibility will be available please ? > Build and Run Spark on JDK11 > > > Key: SPARK-24417 > URL: https://issues.apache.org/jira/browse/SPARK-24417 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 2.3.0 >Reporter: DB Tsai >Priority: Major > > This is an umbrella JIRA for Apache Spark to support JDK11 > As JDK8 is reaching EOL, and JDK9 and 10 are already end of life, per > community discussion, we will skip JDK9 and 10 to support JDK 11 directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26969) [Spark] Using ODBC not able to see the data in table when datatype is decimal
ABHISHEK KUMAR GUPTA created SPARK-26969: Summary: [Spark] Using ODBC not able to see the data in table when datatype is decimal Key: SPARK-26969 URL: https://issues.apache.org/jira/browse/SPARK-26969 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 2.4.0 Reporter: ABHISHEK KUMAR GUPTA # Using odbc rpm file install odbc # connect to odbc using isql -v spark2xsingle # SQL> create table t1_t(id decimal(15,2)); # SQL> insert into t1_t values(15); # SQL> select * from t1_t; +-+ | id | +-+ +-+ Actual output is empty Note: When creating table of int data type select is giving result as below SQL> create table test_t1(id int); SQL> insert into test_t1 values(10); SQL> select * from test_t1; ++ | id | ++ | 10 | ++ Needs to handle for decimal case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26968) option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation
M. Le Bihan created SPARK-26968: --- Summary: option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation Key: SPARK-26968 URL: https://issues.apache.org/jira/browse/SPARK-26968 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: M. Le Bihan I have a CSV to write that has that schema : {code:java} StructType s = schema.add("codeCommuneCR", StringType, false); s = s.add("nomCommuneCR", StringType, false); s = s.add("populationCR", IntegerType, false); s = s.add("resultatComptable", IntegerType, false);{code} If I don't provide an option "_quoteMode_" or even if I set it to {{NON_NUMERIC}}, this way : {code:java} ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") .option("quoteMode", "NON_NUMERIC").option("quote", "\"") .csv("./target/out_200071470.csv");{code} the CSV written by {{Spark}} is this one : {code:java} codeCommuneCR,nomCommuneCR,populationCR,resultatComptable 03142,LENAX,267,43{code} If I set an option "_quoteAll_" instead, like that : {code:java} ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") .option("quoteAll", true).option("quote", "\"") .csv("./target/out_200071470.csv");{code} it generates : {code:java} "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable" "03142","LENAX","267","43"{code} It seems that the {{.option("quoteMode", "NON_NUMERIC")}} is broken. It should generate: {code:java} "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable" "03142","LENAX",267,43 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26945) Python streaming tests flaky while cleaning temp directories after StreamingQuery.stop
[ https://issues.apache.org/jira/browse/SPARK-26945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26945: Assignee: Apache Spark > Python streaming tests flaky while cleaning temp directories after > StreamingQuery.stop > -- > > Key: SPARK-26945 > URL: https://issues.apache.org/jira/browse/SPARK-26945 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Alessandro Bellina >Assignee: Apache Spark >Priority: Minor > > From the test code, it seems like the `shmutil.rmtree` function is trying to > delete a directory, but there's likely another thread adding entries to a > directory, so when it gets to `os.rmdir(path)` it blows up. I think the test > (and other streaming tests) should call `q.awaitTermination` after `q.stop`, > before going on. I'll file a separate jira. > {noformat} > ERROR: test_query_manager_await_termination > (pyspark.sql.tests.test_streaming.StreamingTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests/test_streaming.py", > line 259, in test_query_manager_await_termination > shutil.rmtree(tmpPath) > File "/home/anaconda/lib/python2.7/shutil.py", line 256, in rmtree > onerror(os.rmdir, path, sys.exc_info()) > File "/home/anaconda/lib/python2.7/shutil.py", line 254, in rmtree > os.rmdir(path) > OSError: [Errno 39] Directory not empty: > '/home/jenkins/workspace/SparkPullRequestBuilder/python/target/072153bd-f981-47be-bda2-e2b657a16f65/tmp4WGp7n'{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26945) Python streaming tests flaky while cleaning temp directories after StreamingQuery.stop
[ https://issues.apache.org/jira/browse/SPARK-26945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26945: Assignee: (was: Apache Spark) > Python streaming tests flaky while cleaning temp directories after > StreamingQuery.stop > -- > > Key: SPARK-26945 > URL: https://issues.apache.org/jira/browse/SPARK-26945 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Alessandro Bellina >Priority: Minor > > From the test code, it seems like the `shmutil.rmtree` function is trying to > delete a directory, but there's likely another thread adding entries to a > directory, so when it gets to `os.rmdir(path)` it blows up. I think the test > (and other streaming tests) should call `q.awaitTermination` after `q.stop`, > before going on. I'll file a separate jira. > {noformat} > ERROR: test_query_manager_await_termination > (pyspark.sql.tests.test_streaming.StreamingTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests/test_streaming.py", > line 259, in test_query_manager_await_termination > shutil.rmtree(tmpPath) > File "/home/anaconda/lib/python2.7/shutil.py", line 256, in rmtree > onerror(os.rmdir, path, sys.exc_info()) > File "/home/anaconda/lib/python2.7/shutil.py", line 254, in rmtree > os.rmdir(path) > OSError: [Errno 39] Directory not empty: > '/home/jenkins/workspace/SparkPullRequestBuilder/python/target/072153bd-f981-47be-bda2-e2b657a16f65/tmp4WGp7n'{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26967) Put MetricsSystem instance names together for clearer management
[ https://issues.apache.org/jira/browse/SPARK-26967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26967: Assignee: (was: Apache Spark) > Put MetricsSystem instance names together for clearer management > > > Key: SPARK-26967 > URL: https://issues.apache.org/jira/browse/SPARK-26967 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: SongYadong >Priority: Minor > Original Estimate: 48h > Remaining Estimate: 48h > > MetricsSystem instance creations have a scattered distribution in the project > code. So do their names. It may cause some inconvenience for browsing and > management. > If we put them together, we can have a uniform location for adding or > removing them, and have a overall view of MetircsSystem instances in current > project. > It's also helpful for maintaining user documents by avoiding missing > something. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26967) Put MetricsSystem instance names together for clearer management
[ https://issues.apache.org/jira/browse/SPARK-26967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26967: Assignee: Apache Spark > Put MetricsSystem instance names together for clearer management > > > Key: SPARK-26967 > URL: https://issues.apache.org/jira/browse/SPARK-26967 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: SongYadong >Assignee: Apache Spark >Priority: Minor > Original Estimate: 48h > Remaining Estimate: 48h > > MetricsSystem instance creations have a scattered distribution in the project > code. So do their names. It may cause some inconvenience for browsing and > management. > If we put them together, we can have a uniform location for adding or > removing them, and have a overall view of MetircsSystem instances in current > project. > It's also helpful for maintaining user documents by avoiding missing > something. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26967) Put MetricsSystem instance names together for clearer management
SongYadong created SPARK-26967: -- Summary: Put MetricsSystem instance names together for clearer management Key: SPARK-26967 URL: https://issues.apache.org/jira/browse/SPARK-26967 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: SongYadong MetricsSystem instance creations have a scattered distribution in the project code. So do their names. It may cause some inconvenience for browsing and management. If we put them together, we can have a uniform location for adding or removing them, and have a overall view of MetircsSystem instances in current project. It's also helpful for maintaining user documents by avoiding missing something. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774892#comment-16774892 ] Hyukjin Kwon commented on SPARK-26944: -- Actually it's usually able to see it. It's usually included in the artifact of the built image IIRC. > Python unit-tests.log not available in artifacts for a build in Jenkins > --- > > Key: SPARK-26944 > URL: https://issues.apache.org/jira/browse/SPARK-26944 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Alessandro Bellina >Priority: Minor > > I had a pr where the python unit tests failed. The tests point at the > `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, > but I can't get to that from jenkins UI it seems (are all prs writing to the > same file?). > {code:java} > > Running PySpark tests > > Running PySpark tests. Output is in > /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code} > For reference, please see this build: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console > This Jira is to make it available under the artifacts for each build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org