[jira] [Created] (SPARK-27406) UnsafeArrayData serialization breaks when two machines have different Oops size
peng bo created SPARK-27406: --- Summary: UnsafeArrayData serialization breaks when two machines have different Oops size Key: SPARK-27406 URL: https://issues.apache.org/jira/browse/SPARK-27406 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.1 Reporter: peng bo java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals$$anonfun$endpoints$1.apply(ApproxCountDistinctForIntervals.scala:69) at org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals$$anonfun$endpoints$1.apply(ApproxCountDistinctForIntervals.scala:69) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.endpoints$lzycompute(ApproxCountDistinctForIntervals.scala:69) at org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.endpoints(ApproxCountDistinctForIntervals.scala:66) at org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$hllppArray$lzycompute(ApproxCountDistinctForIntervals.scala:94) at org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$hllppArray(ApproxCountDistinctForIntervals.scala:93) at org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$numWordsPerHllpp$lzycompute(ApproxCountDistinctForIntervals.scala:104) at org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$numWordsPerHllpp(ApproxCountDistinctForIntervals.scala:104) at org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.totalNumWords$lzycompute(ApproxCountDistinctForIntervals.scala:106) at org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.totalNumWords(ApproxCountDistinctForIntervals.scala:106) at org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.createAggregationBuffer(ApproxCountDistinctForIntervals.scala:110) at org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.createAggregationBuffer(ApproxCountDistinctForIntervals.scala:44) at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.initialize(interfaces.scala:528) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator$$anonfun$initAggregationBuffer$2.apply(ObjectAggregationIterator.scala:120) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator$$anonfun$initAggregationBuffer$2.apply(ObjectAggregationIterator.scala:120) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.initAggregationBuffer(ObjectAggregationIterator.scala:120) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.org$apache$spark$sql$execution$aggregate$ObjectAggregationIterator$$createNewAggregationBuffer(ObjectAggregationIterator.scala:112) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.getAggregationBufferByKey(ObjectAggregationIterator.scala:128) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:150) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:78) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:114) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:105) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823) at
[jira] [Commented] (SPARK-27289) spark-submit explicit configuration does not take effect but Spark UI shows it's effective
[ https://issues.apache.org/jira/browse/SPARK-27289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812103#comment-16812103 ] KaiXu commented on SPARK-27289: --- Do you check where the intermediate shuffle data was wrote while changing the spark.local.dir? BTW, changes in spark-defaults.conf seems need to restart to take effect. > spark-submit explicit configuration does not take effect but Spark UI shows > it's effective > -- > > Key: SPARK-27289 > URL: https://issues.apache.org/jira/browse/SPARK-27289 > Project: Spark > Issue Type: Bug > Components: Deploy, Documentation, Spark Submit, Web UI >Affects Versions: 2.3.3 >Reporter: KaiXu >Priority: Minor > Attachments: Capture.PNG > > > The [doc > |https://spark.apache.org/docs/latest/submitting-applications.html]says that > "In general, configuration values explicitly set on a {{SparkConf}} take the > highest precedence, then flags passed to {{spark-submit}}, then values in the > defaults file", but when setting spark.local.dir through --conf with > spark-submit, it still uses the values from > ${SPARK_HOME}/conf/spark-defaults.conf, what's more, the Spark runtime UI > environment variables shows the value from --conf, which is really misleading. > e.g. > I set submit my application through the command: > /opt/spark233/bin/spark-submit --properties-file /opt/spark.conf --conf > spark.local.dir=/tmp/spark_local -v --class > org.apache.spark.examples.mllib.SparseNaiveBayes --master > spark://bdw-slave20:7077 > /opt/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar > hdfs://bdw-slave20:8020/Bayes/Input > > the spark.local.dir in ${SPARK_HOME}/conf/spark-defaults.conf is: > spark.local.dir=/mnt/nvme1/spark_local > when the application is running, I found the intermediate shuffle data was > wrote to /mnt/nvme1/spark_local, which is set through > ${SPARK_HOME}/conf/spark-defaults.conf, but the Web UI shows that the > environment value spark.local.dir=/tmp/spark_local. > The spark-submit verbose also shows spark.local.dir=/tmp/spark_local, it's > misleading. > > !image-2019-03-27-10-59-38-377.png! > spark-submit verbose: > > Spark properties used, including those specified through > --conf and those from the properties file /opt/spark.conf: > (spark.local.dir,/tmp/spark_local) > (spark.default.parallelism,132) > (spark.driver.memory,10g) > (spark.executor.memory,352g) > X -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27403) Failed to update the table size automatically even though spark.sql.statistics.size.autoUpdate.enabled is set as rue
[ https://issues.apache.org/jira/browse/SPARK-27403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujith Chacko updated SPARK-27403: -- Affects Version/s: 2.3.0 2.3.1 2.3.2 2.3.3 2.4.0 > Failed to update the table size automatically even though > spark.sql.statistics.size.autoUpdate.enabled is set as rue > > > Key: SPARK-27403 > URL: https://issues.apache.org/jira/browse/SPARK-27403 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1 >Reporter: Sujith Chacko >Priority: Major > > system shall update the table stats automatically if user set > spark.sql.statistics.size.autoUpdate.enabled as true, currently this property > is not having any significance even if it is enabled or disabled. This > feature is similar to Hives auto-gather feature where statistics are > automatically computed by default if this feature is enabled. > Reference: > [https://cwiki.apache.org/confluence/display/Hive/StatsDev] > Reproducing steps: > scala> spark.sql("create table table1 (name string,age int) stored as > parquet") > scala> spark.sql("insert into table1 select 'a',29") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("desc extended table1").show(false) > > +---+---++--- > |col_name|data_type|comment| > +---+---++--- > |name|string|null| > |age|int|null| > | | | | > | # Detailed Table Information| | | > |Database|default| | > |Table|table1| | > |Owner|Administrator| | > |Created Time|Sun Apr 07 23:41:56 IST 2019| | > |Last Access|Thu Jan 01 05:30:00 IST 1970| | > |Created By|Spark 2.4.1| | > |Type|MANAGED| | > |Provider|hive| | > |Table Properties|[transient_lastDdlTime=1554660716]| | > |Location|file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1| | > |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| | > |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| | > |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| > | > |Storage Properties|[serialization.format=1]| | > |Partition Provider|Catalog| | > +---+---++--- -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27403) Failed to update the table size automatically even though spark.sql.statistics.size.autoUpdate.enabled is set as rue
[ https://issues.apache.org/jira/browse/SPARK-27403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812095#comment-16812095 ] Sujith Chacko edited comment on SPARK-27403 at 4/8/19 4:28 AM: --- This has impact in the previous version also, will update the JIRA was (Author: s71955): This has impact in the previous version also, will upate the JIRA > Failed to update the table size automatically even though > spark.sql.statistics.size.autoUpdate.enabled is set as rue > > > Key: SPARK-27403 > URL: https://issues.apache.org/jira/browse/SPARK-27403 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 >Reporter: Sujith Chacko >Priority: Major > > system shall update the table stats automatically if user set > spark.sql.statistics.size.autoUpdate.enabled as true, currently this property > is not having any significance even if it is enabled or disabled. This > feature is similar to Hives auto-gather feature where statistics are > automatically computed by default if this feature is enabled. > Reference: > [https://cwiki.apache.org/confluence/display/Hive/StatsDev] > Reproducing steps: > scala> spark.sql("create table table1 (name string,age int) stored as > parquet") > scala> spark.sql("insert into table1 select 'a',29") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("desc extended table1").show(false) > > +---+---++--- > |col_name|data_type|comment| > +---+---++--- > |name|string|null| > |age|int|null| > | | | | > | # Detailed Table Information| | | > |Database|default| | > |Table|table1| | > |Owner|Administrator| | > |Created Time|Sun Apr 07 23:41:56 IST 2019| | > |Last Access|Thu Jan 01 05:30:00 IST 1970| | > |Created By|Spark 2.4.1| | > |Type|MANAGED| | > |Provider|hive| | > |Table Properties|[transient_lastDdlTime=1554660716]| | > |Location|file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1| | > |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| | > |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| | > |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| > | > |Storage Properties|[serialization.format=1]| | > |Partition Provider|Catalog| | > +---+---++--- -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27403) Failed to update the table size automatically even though spark.sql.statistics.size.autoUpdate.enabled is set as rue
[ https://issues.apache.org/jira/browse/SPARK-27403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812095#comment-16812095 ] Sujith Chacko commented on SPARK-27403: --- This has impact in the previous version also, will upate the JIRA > Failed to update the table size automatically even though > spark.sql.statistics.size.autoUpdate.enabled is set as rue > > > Key: SPARK-27403 > URL: https://issues.apache.org/jira/browse/SPARK-27403 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 >Reporter: Sujith Chacko >Priority: Major > > system shall update the table stats automatically if user set > spark.sql.statistics.size.autoUpdate.enabled as true, currently this property > is not having any significance even if it is enabled or disabled. This > feature is similar to Hives auto-gather feature where statistics are > automatically computed by default if this feature is enabled. > Reference: > [https://cwiki.apache.org/confluence/display/Hive/StatsDev] > Reproducing steps: > scala> spark.sql("create table table1 (name string,age int) stored as > parquet") > scala> spark.sql("insert into table1 select 'a',29") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("desc extended table1").show(false) > > +---+---++--- > |col_name|data_type|comment| > +---+---++--- > |name|string|null| > |age|int|null| > | | | | > | # Detailed Table Information| | | > |Database|default| | > |Table|table1| | > |Owner|Administrator| | > |Created Time|Sun Apr 07 23:41:56 IST 2019| | > |Last Access|Thu Jan 01 05:30:00 IST 1970| | > |Created By|Spark 2.4.1| | > |Type|MANAGED| | > |Provider|hive| | > |Table Properties|[transient_lastDdlTime=1554660716]| | > |Location|file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1| | > |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| | > |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| | > |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| > | > |Storage Properties|[serialization.format=1]| | > |Partition Provider|Catalog| | > +---+---++--- -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25348) Data source for binary files
[ https://issues.apache.org/jira/browse/SPARK-25348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812090#comment-16812090 ] Weichen Xu commented on SPARK-25348: I am working on this. :) > Data source for binary files > > > Key: SPARK-25348 > URL: https://issues.apache.org/jira/browse/SPARK-25348 > Project: Spark > Issue Type: Story > Components: ML, SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > It would be useful to have a data source implementation for binary files, > which can be used to build features to load images, audio, and videos. > Microsoft has an implementation at > [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be > great if we can merge it into Spark main repo. > cc: [~mhamilton] and [~imatiach] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23831) Add org.apache.derby to IsolatedClientLoader
[ https://issues.apache.org/jira/browse/SPARK-23831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-23831: Description: Add org.apache.derby to IsolatedClientLoader,otherwise it may throw an exception: {noformat} [info] Cause: java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@2439ab23, see the next exception for details. [info] at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) [info] at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) [info] at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source) [info] at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown Source) [info] at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown Source) [info] at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source) {noformat} How to reproduce: {noformat} build/sbt clean package -Phive -Phive-thriftserver export SPARK_PREPEND_CLASSES=true bin/spark-sql --conf spark.sql.hive.metastore.version=2.3.4 --conf spark.sql.hive.metastore.jars=maven -e "create table t1 as select 1 as c" {noformat} was: Add org.apache.derby to IsolatedClientLoader,otherwise it may throw an exception: {noformat} [info] Cause: java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@2439ab23, see the next exception for details. [info] at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) [info] at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) [info] at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source) [info] at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown Source) [info] at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown Source) [info] at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source) {noformat} How to reproduce: {noformat} sed 's/HiveExternalCatalogSuite/HiveExternalCatalog2Suite/g' sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala > sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalog2Suite.scala build/sbt -Phive "hive/test-only *.HiveExternalCatalogSuite *.HiveExternalCatalog2Suite" {noformat} > Add org.apache.derby to IsolatedClientLoader > > > Key: SPARK-23831 > URL: https://issues.apache.org/jira/browse/SPARK-23831 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > Add org.apache.derby to IsolatedClientLoader,otherwise it may throw an > exception: > {noformat} > [info] Cause: java.sql.SQLException: Failed to start database 'metastore_db' > with class loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@2439ab23, see > the next exception for details. > [info] at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > [info] at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > [info] at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source) > [info] at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown > Source) > [info] at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown Source) > [info] at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source) > {noformat} > How to reproduce: > {noformat} > build/sbt clean package -Phive -Phive-thriftserver > export SPARK_PREPEND_CLASSES=true > bin/spark-sql --conf spark.sql.hive.metastore.version=2.3.4 --conf > spark.sql.hive.metastore.jars=maven -e "create table t1 as select 1 as c" > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27405) Restrict the range of generated random timestamps
Maxim Gekk created SPARK-27405: -- Summary: Restrict the range of generated random timestamps Key: SPARK-27405 URL: https://issues.apache.org/jira/browse/SPARK-27405 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 2.4.0 Reporter: Maxim Gekk The timestampLiteralGen of LiteralGenerator can produce instances of java.sql.Timestamp that cause Long arithmetic overflow on conversion milliseconds to microseconds. The former conversion is performed because Catalyst's Timestamp type stores microseconds since epoch internally. The ticket aims to restrict the range of generated random timestamps to [Long.MaxValue / 1000, Long.MinValue/ 1000]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27403) Failed to update the table size automatically even though spark.sql.statistics.size.autoUpdate.enabled is set as rue
[ https://issues.apache.org/jira/browse/SPARK-27403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811959#comment-16811959 ] Dongjoon Hyun commented on SPARK-27403: --- Hi, [~S71955]. Could you check Spark 2.3.x behavior and update the affected versions? The existing code lands at 2.3.0 by SPARK-21237 . > Failed to update the table size automatically even though > spark.sql.statistics.size.autoUpdate.enabled is set as rue > > > Key: SPARK-27403 > URL: https://issues.apache.org/jira/browse/SPARK-27403 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 >Reporter: Sujith Chacko >Priority: Major > > system shall update the table stats automatically if user set > spark.sql.statistics.size.autoUpdate.enabled as true, currently this property > is not having any significance even if it is enabled or disabled. This > feature is similar to Hives auto-gather feature where statistics are > automatically computed by default if this feature is enabled. > Reference: > [https://cwiki.apache.org/confluence/display/Hive/StatsDev] > Reproducing steps: > scala> spark.sql("create table table1 (name string,age int) stored as > parquet") > scala> spark.sql("insert into table1 select 'a',29") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("desc extended table1").show(false) > > +---+---++--- > |col_name|data_type|comment| > +---+---++--- > |name|string|null| > |age|int|null| > | | | | > | # Detailed Table Information| | | > |Database|default| | > |Table|table1| | > |Owner|Administrator| | > |Created Time|Sun Apr 07 23:41:56 IST 2019| | > |Last Access|Thu Jan 01 05:30:00 IST 1970| | > |Created By|Spark 2.4.1| | > |Type|MANAGED| | > |Provider|hive| | > |Table Properties|[transient_lastDdlTime=1554660716]| | > |Location|file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1| | > |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| | > |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| | > |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| > | > |Storage Properties|[serialization.format=1]| | > |Partition Provider|Catalog| | > +---+---++--- -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27404) Fix build warnings for 3.0: postfixOps edition
Sean Owen created SPARK-27404: - Summary: Fix build warnings for 3.0: postfixOps edition Key: SPARK-27404 URL: https://issues.apache.org/jira/browse/SPARK-27404 Project: Spark Issue Type: Improvement Components: Spark Core, SQL, Structured Streaming, YARN Affects Versions: 3.0.0 Reporter: Sean Owen Assignee: Sean Owen I'd like to fix various build warnings showing in the build right now -- see the upcoming PR for details as they are varied and small. However while fixing warnings about use of postfix notation (i.e. "foo bar" instead of "foo.bar"), I'd like to just remove use of postfix entirely to standardize. They aren't deprecated exactly, but seemed to be frowned upon as usually adding more confusion than clarify (https://contributors.scala-lang.org/t/lets-drop-postfix-operators/1457) and have to be enabled by importing scala.language.postfixOps to avoid warnings. I find that use of scalatest postfix syntax doesn't cause warnings, and that's normal usage for scalatest, so will leave that. "0 until n" syntax also doesn't trigger the warnings, it seems. But things like "10 seconds" do, and can be "10.seconds". Part of the reason I went ahead in changing that is that we have many instances of things like "12 milliseconds" in the code, which are simpler as "2.minutes" anyway. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27403) Failed to update the table size automatically even though spark.sql.statistics.size.autoUpdate.enabled is set as rue
[ https://issues.apache.org/jira/browse/SPARK-27403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811940#comment-16811940 ] Sujith Chacko edited comment on SPARK-27403 at 4/7/19 6:33 PM: --- if user set spark.sql.statistics.size.autoUpdate.enabled as true, then system shall calculate the table size and record the same in metastore, On describe command the statistics shall be displayed I will analyze further and raise a PR for handling the issue. please let me know for any suggestions. thanks was (Author: s71955): I will analyze further and raise a PR for handling the issue. please let me know for any suggestions. thanks > Failed to update the table size automatically even though > spark.sql.statistics.size.autoUpdate.enabled is set as rue > > > Key: SPARK-27403 > URL: https://issues.apache.org/jira/browse/SPARK-27403 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 >Reporter: Sujith Chacko >Priority: Major > > system shall update the table stats automatically if user set > spark.sql.statistics.size.autoUpdate.enabled as true, currently this property > is not having any significance even if it is enabled or disabled. This > feature is similar to Hives auto-gather feature where statistics are > automatically computed by default if this feature is enabled. > Reference: > [https://cwiki.apache.org/confluence/display/Hive/StatsDev] > Reproducing steps: > scala> spark.sql("create table table1 (name string,age int) stored as > parquet") > scala> spark.sql("insert into table1 select 'a',29") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("desc extended table1").show(false) > > +---+---++--- > |col_name|data_type|comment| > +---+---++--- > |name|string|null| > |age|int|null| > | | | | > | # Detailed Table Information| | | > |Database|default| | > |Table|table1| | > |Owner|Administrator| | > |Created Time|Sun Apr 07 23:41:56 IST 2019| | > |Last Access|Thu Jan 01 05:30:00 IST 1970| | > |Created By|Spark 2.4.1| | > |Type|MANAGED| | > |Provider|hive| | > |Table Properties|[transient_lastDdlTime=1554660716]| | > |Location|file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1| | > |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| | > |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| | > |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| > | > |Storage Properties|[serialization.format=1]| | > |Partition Provider|Catalog| | > +---+---++--- -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27403) Failed to update the table size automatically even though spark.sql.statistics.size.autoUpdate.enabled is set as rue
[ https://issues.apache.org/jira/browse/SPARK-27403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujith Chacko updated SPARK-27403: -- Description: system shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled. Reference: [https://cwiki.apache.org/confluence/display/Hive/StatsDev] Reproducing steps: scala> spark.sql("create table table1 (name string,age int) stored as parquet")scala> spark.sql("insert into table1 select 'a',29") res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("desc extended table1").show(false) +--+++--- |col_name|data_type|comment| +--+++--- |name|string|null| |age|int|null| | | | | | # Detailed Table Information| | | |Database|default| | |Table|table1| | |Owner|Administrator| | |Created Time|Sun Apr 07 23:41:56 IST 2019| | |Last Access|Thu Jan 01 05:30:00 IST 1970| | |Created By|Spark 2.4.1| | |Type|MANAGED| | |Provider|hive| | |Table Properties|[transient_lastDdlTime=1554660716]| | |Location|file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1| | |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| | |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| | |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | |Storage Properties|[serialization.format=1]| | |Partition Provider|Catalog| | +--+++--- was: system shall update the table stats automatiaclly if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is anabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled. Reference: [https://cwiki.apache.org/confluence/display/Hive/StatsDev] Reproducing steps: scala> spark.sql("create table table1 (name string,age int) stored as parquet")scala> spark.sql("insert into table1 select 'a',29") res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("desc extended table1").show(false) +-+-++--- |col_name|data_type|comment| +-+-++--- |name|string|null| |age|int|null| | | | | | # Detailed Table Information| | | |Database|default| | |Table|table1| | |Owner|Administrator| | |Created Time|Sun Apr 07 23:41:56 IST 2019| | |Last Access|Thu Jan 01 05:30:00 IST 1970| | |Created By|Spark 2.4.1| | |Type|MANAGED| | |Provider|hive| | |Table Properties|[transient_lastDdlTime=1554660716]| | |Location|file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1| | |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| | |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| | |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | |Storage Properties|[serialization.format=1]| | |Partition Provider|Catalog| | +-+-++--- > Failed to update the table size automatically even though > spark.sql.statistics.size.autoUpdate.enabled is set as rue > > > Key: SPARK-27403 > URL: https://issues.apache.org/jira/browse/SPARK-27403 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 >Reporter: Sujith Chacko >Priority: Major > > system shall update the table stats automatically if user set > spark.sql.statistics.size.autoUpdate.enabled as true, currently this property > is not having any significance even if it is enabled or disabled. This > feature is similar to Hives auto-gather feature where statistics are > automatically computed by default if this feature is enabled. > Reference: > [https://cwiki.apache.org/confluence/display/Hive/StatsDev] > Reproducing steps: > scala> spark.sql("create table table1 (name string,age int) stored as > parquet")scala> spark.sql("insert into table1 select 'a',29") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("desc extended table1").show(false) > >
[jira] [Updated] (SPARK-27403) Failed to update the table size automatically even though spark.sql.statistics.size.autoUpdate.enabled is set as rue
[ https://issues.apache.org/jira/browse/SPARK-27403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujith Chacko updated SPARK-27403: -- Description: system shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled. Reference: [https://cwiki.apache.org/confluence/display/Hive/StatsDev] Reproducing steps: scala> spark.sql("create table table1 (name string,age int) stored as parquet") scala> spark.sql("insert into table1 select 'a',29") res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("desc extended table1").show(false) +---+---++--- |col_name|data_type|comment| +---+---++--- |name|string|null| |age|int|null| | | | | | # Detailed Table Information| | | |Database|default| | |Table|table1| | |Owner|Administrator| | |Created Time|Sun Apr 07 23:41:56 IST 2019| | |Last Access|Thu Jan 01 05:30:00 IST 1970| | |Created By|Spark 2.4.1| | |Type|MANAGED| | |Provider|hive| | |Table Properties|[transient_lastDdlTime=1554660716]| | |Location|file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1| | |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| | |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| | |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | |Storage Properties|[serialization.format=1]| | |Partition Provider|Catalog| | +---+---++--- was: system shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled. Reference: [https://cwiki.apache.org/confluence/display/Hive/StatsDev] Reproducing steps: scala> spark.sql("create table table1 (name string,age int) stored as parquet")scala> spark.sql("insert into table1 select 'a',29") res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("desc extended table1").show(false) +--+++--- |col_name|data_type|comment| +--+++--- |name|string|null| |age|int|null| | | | | | # Detailed Table Information| | | |Database|default| | |Table|table1| | |Owner|Administrator| | |Created Time|Sun Apr 07 23:41:56 IST 2019| | |Last Access|Thu Jan 01 05:30:00 IST 1970| | |Created By|Spark 2.4.1| | |Type|MANAGED| | |Provider|hive| | |Table Properties|[transient_lastDdlTime=1554660716]| | |Location|file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1| | |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| | |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| | |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | |Storage Properties|[serialization.format=1]| | |Partition Provider|Catalog| | +--+++--- > Failed to update the table size automatically even though > spark.sql.statistics.size.autoUpdate.enabled is set as rue > > > Key: SPARK-27403 > URL: https://issues.apache.org/jira/browse/SPARK-27403 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 >Reporter: Sujith Chacko >Priority: Major > > system shall update the table stats automatically if user set > spark.sql.statistics.size.autoUpdate.enabled as true, currently this property > is not having any significance even if it is enabled or disabled. This > feature is similar to Hives auto-gather feature where statistics are > automatically computed by default if this feature is enabled. > Reference: > [https://cwiki.apache.org/confluence/display/Hive/StatsDev] > Reproducing steps: > scala> spark.sql("create table table1 (name string,age int) stored as > parquet") > scala> spark.sql("insert into table1 select 'a',29") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("desc extended table1").show(false) > >
[jira] [Commented] (SPARK-27403) Failed to update the table size automatically even though spark.sql.statistics.size.autoUpdate.enabled is set as rue
[ https://issues.apache.org/jira/browse/SPARK-27403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811940#comment-16811940 ] Sujith Chacko commented on SPARK-27403: --- I will analyze further and raise a PR for handling the issue. please let me know for any suggestions. thanks > Failed to update the table size automatically even though > spark.sql.statistics.size.autoUpdate.enabled is set as rue > > > Key: SPARK-27403 > URL: https://issues.apache.org/jira/browse/SPARK-27403 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 >Reporter: Sujith Chacko >Priority: Major > > system shall update the table stats automatiaclly if user set > spark.sql.statistics.size.autoUpdate.enabled as true, currently this property > is not having any significance even if it is anabled or disabled. This > feature is similar to Hives auto-gather feature where statistics are > automatically computed by default if this feature is enabled. > Reference: > [https://cwiki.apache.org/confluence/display/Hive/StatsDev] > Reproducing steps: > scala> spark.sql("create table table1 (name string,age int) stored as > parquet")scala> spark.sql("insert into table1 select 'a',29") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("desc extended table1").show(false) > > +-+-++--- > |col_name|data_type|comment| > +-+-++--- > |name|string|null| > |age|int|null| > | | | | > | # Detailed Table Information| | | > |Database|default| | > |Table|table1| | > |Owner|Administrator| | > |Created Time|Sun Apr 07 23:41:56 IST 2019| | > |Last Access|Thu Jan 01 05:30:00 IST 1970| | > |Created By|Spark 2.4.1| | > |Type|MANAGED| | > |Provider|hive| | > |Table Properties|[transient_lastDdlTime=1554660716]| | > |Location|file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1| | > |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| | > |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| | > |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| > | > |Storage Properties|[serialization.format=1]| | > |Partition Provider|Catalog| | > +-+-++--- -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27403) Failed to update the table size automatically even though spark.sql.statistics.size.autoUpdate.enabled is set as rue
[ https://issues.apache.org/jira/browse/SPARK-27403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujith Chacko updated SPARK-27403: -- Description: system shall update the table stats automatiaclly if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is anabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled. Reference: [https://cwiki.apache.org/confluence/display/Hive/StatsDev] Reproducing steps: scala> spark.sql("create table table1 (name string,age int) stored as parquet")scala> spark.sql("insert into table1 select 'a',29") res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("desc extended table1").show(false) +-+-++--- |col_name|data_type|comment| +-+-++--- |name|string|null| |age|int|null| | | | | | # Detailed Table Information| | | |Database|default| | |Table|table1| | |Owner|Administrator| | |Created Time|Sun Apr 07 23:41:56 IST 2019| | |Last Access|Thu Jan 01 05:30:00 IST 1970| | |Created By|Spark 2.4.1| | |Type|MANAGED| | |Provider|hive| | |Table Properties|[transient_lastDdlTime=1554660716]| | |Location|file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1| | |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| | |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| | |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | |Storage Properties|[serialization.format=1]| | |Partition Provider|Catalog| | +-+-++--- was: scala> spark.sql("insert into table1 select 'a',29") res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("desc extended table1").show(false) ++--+---+ |col_name |data_type |comment| ++--+---+ |name |string |null | |age |int |null | | | | | |# Detailed Table Information| | | |Database |default | | |Table |table1 | | |Owner |Administrator | | |Created Time |Sun Apr 07 23:41:56 IST 2019 | | |Last Access |Thu Jan 01 05:30:00 IST 1970 | | |Created By |Spark 2.4.1 | | |Type |MANAGED | | |Provider |hive | | |Table Properties |[transient_lastDdlTime=1554660716] | | |Location |file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1 | | |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | | |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | | |OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | |Storage Properties |[serialization.format=1] | | |Partition Provider |Catalog | | ++--+---+ > Failed to update the table size automatically even though > spark.sql.statistics.size.autoUpdate.enabled is set as rue > > > Key: SPARK-27403 > URL: https://issues.apache.org/jira/browse/SPARK-27403 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 >Reporter: Sujith Chacko >Priority: Major > > system shall update the table stats automatiaclly if user set > spark.sql.statistics.size.autoUpdate.enabled as true, currently this property > is not having any significance even if it is anabled or disabled. This > feature is similar to Hives auto-gather feature where statistics are > automatically computed by default if this feature is enabled. > Reference: > [https://cwiki.apache.org/confluence/display/Hive/StatsDev] > Reproducing steps: > scala> spark.sql("create table table1 (name string,age int) stored as > parquet")scala> spark.sql("insert into table1 select 'a',29") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("desc extended table1").show(false) > > +-+-++--- > |col_name|data_type|comment| > +-+-++--- > |name|string|null| > |age|int|null| > | | | | > | # Detailed Table Information| | | > |Database|default| | > |Table|table1| | > |Owner|Administrator| | > |Created Time|Sun Apr 07 23:41:56 IST 2019| | > |Last Access|Thu Jan 01 05:30:00 IST 1970| | > |Created By|Spark 2.4.1| | > |Type|MANAGED| | >
[jira] [Updated] (SPARK-27403) Failed to update the table size automatically even though spark.sql.statistics.size.autoUpdate.enabled is set as rue
[ https://issues.apache.org/jira/browse/SPARK-27403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujith Chacko updated SPARK-27403: -- Environment: (was: system shall able to update the table stats automatically if user sets spark.sql.statistics.size.autoUpdate.enabled as true , Currently this property doesnt have any significance even if it is enabled or disabled. Please follow the below steps to reproduce the issue scala> spark.sql("create table x222 (name string,age int) stored as parquet") 19/04/07 23:41:56 WARN HiveMetaStore: Location: file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1 specified for non-external table:table1 -chgrp: 'HTIPL-23270\None' does not match expected pattern for group Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... res0: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into table1 select 'a',29") res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("desc extended table1").show(false) ++--+---+ |col_name |data_type |comment| ++--+---+ |name |string |null | |age |int |null | | | | | |# Detailed Table Information| | | |Database |default | | |Table |table1 | | |Owner |Administrator | | |Created Time |Sun Apr 07 23:41:56 IST 2019 | | |Last Access |Thu Jan 01 05:30:00 IST 1970 | | |Created By |Spark 2.4.1 | | |Type |MANAGED | | |Provider |hive | | |Table Properties |[transient_lastDdlTime=1554660716] | | |Location |file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1 | | |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | | |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | | |OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | |Storage Properties |[serialization.format=1] | | |Partition Provider |Catalog | | ++--+---+) > Failed to update the table size automatically even though > spark.sql.statistics.size.autoUpdate.enabled is set as rue > > > Key: SPARK-27403 > URL: https://issues.apache.org/jira/browse/SPARK-27403 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 >Reporter: Sujith Chacko >Priority: Major > > scala> spark.sql("insert into table1 select 'a',29") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("desc extended table1").show(false) > ++--+---+ > |col_name |data_type |comment| > ++--+---+ > |name |string |null | > |age |int |null | > | | | | > |# Detailed Table Information| | | > |Database |default | | > |Table |table1 | | > |Owner |Administrator | | > |Created Time |Sun Apr 07 23:41:56 IST 2019 | | > |Last Access |Thu Jan 01 05:30:00 IST 1970 | | > |Created By |Spark 2.4.1 | | > |Type |MANAGED | | > |Provider |hive | | > |Table Properties |[transient_lastDdlTime=1554660716] | | > |Location |file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1 | | > |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | > | > |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | > | > |OutputFormat > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | > |Storage Properties |[serialization.format=1] | | > |Partition Provider |Catalog | | > ++--+---+ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27403) Failed to update the table size automatically even though spark.sql.statistics.size.autoUpdate.enabled is set as rue
[ https://issues.apache.org/jira/browse/SPARK-27403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujith Chacko updated SPARK-27403: -- Description: scala> spark.sql("insert into table1 select 'a',29") res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("desc extended table1").show(false) ++--+---+ |col_name |data_type |comment| ++--+---+ |name |string |null | |age |int |null | | | | | |# Detailed Table Information| | | |Database |default | | |Table |table1 | | |Owner |Administrator | | |Created Time |Sun Apr 07 23:41:56 IST 2019 | | |Last Access |Thu Jan 01 05:30:00 IST 1970 | | |Created By |Spark 2.4.1 | | |Type |MANAGED | | |Provider |hive | | |Table Properties |[transient_lastDdlTime=1554660716] | | |Location |file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1 | | |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | | |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | | |OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | |Storage Properties |[serialization.format=1] | | |Partition Provider |Catalog | | ++--+---+ > Failed to update the table size automatically even though > spark.sql.statistics.size.autoUpdate.enabled is set as rue > > > Key: SPARK-27403 > URL: https://issues.apache.org/jira/browse/SPARK-27403 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 > Environment: system shall able to update the table stats > automatically if user sets spark.sql.statistics.size.autoUpdate.enabled as > true , Currently this property doesnt have any significance even if it is > enabled or disabled. > Please follow the below steps to reproduce the issue > scala> spark.sql("create table x222 (name string,age int) stored as parquet") > 19/04/07 23:41:56 WARN HiveMetaStore: Location: > file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1 specified for > non-external table:table1 > -chgrp: 'HTIPL-23270\None' does not match expected pattern for group > Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("insert into table1 select 'a',29") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("desc extended table1").show(false) > ++--+---+ > |col_name |data_type |comment| > ++--+---+ > |name |string |null | > |age |int |null | > | | | | > |# Detailed Table Information| | | > |Database |default | | > |Table |table1 | | > |Owner |Administrator | | > |Created Time |Sun Apr 07 23:41:56 IST 2019 | | > |Last Access |Thu Jan 01 05:30:00 IST 1970 | | > |Created By |Spark 2.4.1 | | > |Type |MANAGED | | > |Provider |hive | | > |Table Properties |[transient_lastDdlTime=1554660716] | | > |Location |file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1 | | > |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | > | > |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | > | > |OutputFormat > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | > |Storage Properties |[serialization.format=1] | | > |Partition Provider |Catalog | | > ++--+---+ >Reporter: Sujith Chacko >Priority: Major > > scala> spark.sql("insert into table1 select 'a',29") > res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("desc extended table1").show(false) > ++--+---+ > |col_name |data_type |comment| > ++--+---+ > |name |string |null | > |age |int |null | > | | | | > |# Detailed Table Information| | | > |Database |default | | > |Table |table1 | | > |Owner |Administrator | | > |Created Time |Sun Apr 07 23:41:56 IST 2019 | | > |Last Access |Thu Jan 01 05:30:00 IST 1970 | | > |Created By |Spark 2.4.1 | | > |Type |MANAGED | | > |Provider |hive | | > |Table Properties |[transient_lastDdlTime=1554660716] | | > |Location |file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1 | | > |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
[jira] [Created] (SPARK-27403) Failed to update the table size automatically even though spark.sql.statistics.size.autoUpdate.enabled is set as rue
Sujith Chacko created SPARK-27403: - Summary: Failed to update the table size automatically even though spark.sql.statistics.size.autoUpdate.enabled is set as rue Key: SPARK-27403 URL: https://issues.apache.org/jira/browse/SPARK-27403 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.1 Environment: system shall able to update the table stats automatically if user sets spark.sql.statistics.size.autoUpdate.enabled as true , Currently this property doesnt have any significance even if it is enabled or disabled. Please follow the below steps to reproduce the issue scala> spark.sql("create table x222 (name string,age int) stored as parquet") 19/04/07 23:41:56 WARN HiveMetaStore: Location: file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1 specified for non-external table:table1 -chgrp: 'HTIPL-23270\None' does not match expected pattern for group Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... res0: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into table1 select 'a',29") res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("desc extended table1").show(false) ++--+---+ |col_name |data_type |comment| ++--+---+ |name |string |null | |age |int |null | | | | | |# Detailed Table Information| | | |Database |default | | |Table |table1 | | |Owner |Administrator | | |Created Time |Sun Apr 07 23:41:56 IST 2019 | | |Last Access |Thu Jan 01 05:30:00 IST 1970 | | |Created By |Spark 2.4.1 | | |Type |MANAGED | | |Provider |hive | | |Table Properties |[transient_lastDdlTime=1554660716] | | |Location |file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1 | | |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | | |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | | |OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | |Storage Properties |[serialization.format=1] | | |Partition Provider |Catalog | | ++--+---+ Reporter: Sujith Chacko -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer
[ https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811932#comment-16811932 ] Jen Darrouzet commented on SPARK-26970: --- I am a newbie but would very much like to use the interaction transformer in pyspark and am available to help QA test it, when/if it becomes available. > Can't load PipelineModel that was created in Scala with Python due to missing > Interaction transformer > - > > Key: SPARK-26970 > URL: https://issues.apache.org/jira/browse/SPARK-26970 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Andrew Crosby >Priority: Minor > > The Interaction transformer > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala] > is missing from the set of pyspark feature transformers > [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py] > > This means that it is impossible to create a model that includes an > Interaction transformer with pyspark. It also means that attempting to load a > PipelineModel created in Scala that includes an Interaction transformer with > pyspark fails with the following error: > {code:java} > AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26910) Re-release SparkR to CRAN
[ https://issues.apache.org/jira/browse/SPARK-26910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811930#comment-16811930 ] Michael Chirico commented on SPARK-26910: - awesome! thanks and congrats :) > Re-release SparkR to CRAN > - > > Key: SPARK-26910 > URL: https://issues.apache.org/jira/browse/SPARK-26910 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Michael Chirico >Assignee: Felix Cheung >Priority: Major > Fix For: 2.4.1 > > > The logical successor to https://issues.apache.org/jira/browse/SPARK-15799 > I don't see anything specifically tracking re-release in the Jira list. It > would be helpful to have an issue tracking this to refer to as an outsider, > as well as to document what the blockers are in case some outside help could > be useful. > * Is there a plan to re-release SparkR to CRAN? > * What are the major blockers to doing so at the moment? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27388) expression encoder for avro like objects
[ https://issues.apache.org/jira/browse/SPARK-27388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Taoufik DACHRAOUI updated SPARK-27388: -- Description: *What changes were proposed in this pull request?* This PR adds expression encoders for beans, java.util.List, java.util.Map and java enum. The Beans are objects defined by properties; A property is defined by a setter and a getter functions where the getter return type is equal to the setter unique parameter type and the getter and setter functions have the same name; if the getter name is prefixed by "get" then the setter name must be prefixed by "set"; see tests for bean examples. Avro objects are beans and thus we can create an expression encoder for avro objects as follows: {code:java} implicit val exprEncoder = ExpressionEncoder[Foo]() {code} All avro types, including fixed types, and excluding complex union types, are suppported by this addition. The avro fixed types are beans with exactly one property: bytes. Currently complex avro unions are not supported because a complex union is declared as Object and there cannot be an expression encoder for Object type (need to use a custom serializer like kryo for example) *How was this patch tested?* currently only 1 encodeDecodeTest was added to ExpressionEncoderSuite; the test uses an avro object with map, array and fixed fields, as in the test 3 (below). I used the modified spark-sql package in a local project to test it the tests are as follows (where Barcode is a large Avro object): 1. test with simple beans {code:java} class Bar { private var bar$: String = _ def bar(value: String): Unit = { bar$ = value } def bar(): String = bar$ override def toString() = { s"Bar($bar)" } } class Foo extends Bar { var a: Int = _ var b: Bar = _ def getA() = a def setA(x: Int) { a = x } def getB(): Bar = b def setB(x: Bar) { b = x } override def toString() = { s"Foo($a,$b, ${bar()})" } } {code} {code:java} implicit val encoderFoo = ExpressionEncoder[Foo] val bar = new Bar bar.bar("ok") val a = new Foo a.setA(55) a.setB(bar) a.bar("BAR") val ds = List(a).toDS() println(ds.collect().toList) val df = List(a).toDF() val r = df.collect().foreach(println) println(df.schema) Result => List(Foo(55,Bar(ok), BAR)) [[ok],55,BAR] StructType(StructField(B,StructType(StructField(bar,StringType,true)),true), StructField(A,IntegerType,false), StructField(bar,StringType,true)) {code} 2. test with a large Avro schema (Barcode) with nested avro objects, java enums and arrays: {code:java} implicit val barcodeEncoder = ExpressionEncoder[Barcode]() val ds: Dataset[Barcode] = 0.until(1).map(i => { val crpAddDesc = new java.util.ArrayList[CrpAddDesc]() crpAddDesc.add(CrpAddDesc.newBuilder() .setCrpAddEngDesc(s"Crp description1 $i") .build()) crpAddDesc.add(CrpAddDesc.newBuilder() .setCrpAddEngDesc(s"Crp description2 $i") .build()) val crpAttrs = new java.util.ArrayList[CrpAttributes]() crpAttrs.add(CrpAttributes.newBuilder() .setCrpCode(s"crp attr1 $i") .setCrpAddDesc(crpAddDesc) .build()) crpAttrs.add(CrpAttributes.newBuilder() .setCrpCode(s"crp attr2 $i") .build()) val barcode = Barcode.newBuilder() .setBarcode(s"Bar$i") .setPrdTaxVal(Money.newBuilder() .setUnscaledAmount(i.toLong) .setScale(0) .setCurrency(Currency.EUR) .setCurrencyAlphaCode("EUR") .build()) .setCrpAttributes(crpAttrs) .build() barcode }) .toDS() val x = ds.map(a => { ( a.getBarcode, a.getPrdTaxVal.getCurrency, a.getCrpAttributes.get(0).getCrpCode, a.getCrpAttributes.get(1).getCrpCode) }) println(x.collect().toList.drop(100).head) result => (Bar100,EUR,crp attr1 100,crp attr2 100) {code} 3. test with Avro schema having map, array and fixed types {code:java} implicit val logEncoder = ExpressionEncoder[Log]() val ds: Dataset[Log] = List(Log.newBuilder() .setIps(List("127.0.0.1", "127.0.0.0").asJava) .setAdditional(Map( "foo" -> new java.lang.Integer(1), "bar" -> new java.lang.Integer(2)).asJava) .setTimestamp("12345678") .setMessage("test map") .setMagic(new Magic("magic".getBytes)) .build()) .toDS() println(ds.collect().toList) result: List({"ips": ["127.0.0.1", "127.0.0.0"], "timestamp": "12345678", "message": "test map", "magic": [109, 97, 103, 105, 99], "additional": {"foo": 1, "bar": 2}}) {code} Log Avro schema: {code:java} {"namespace": "example.avro", "type": "record", "name": "Log", "fields": [ {"name": "ips", "type": {"type": "array", "items": "string"}}, {"name": "timestamp", "type": "string"},
[jira] [Updated] (SPARK-27402) Support HiveExternalCatalog backward compatibility test
[ https://issues.apache.org/jira/browse/SPARK-27402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27402: Description: When we upgrade the built-in Hive to 2.3.4, the default spark.sql.hive.metastore.version should be 2.3.4. This will not be compatible with spark-2.3.3-bin-hadoop2.7.tgz and spark-2.4.1-bin-hadoop2.7.tgz. > Support HiveExternalCatalog backward compatibility test > --- > > Key: SPARK-27402 > URL: https://issues.apache.org/jira/browse/SPARK-27402 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > When we upgrade the built-in Hive to 2.3.4, the default > spark.sql.hive.metastore.version should be 2.3.4. This will not be compatible > with spark-2.3.3-bin-hadoop2.7.tgz and spark-2.4.1-bin-hadoop2.7.tgz. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27402) Support HiveExternalCatalog backward compatibility test
Yuming Wang created SPARK-27402: --- Summary: Support HiveExternalCatalog backward compatibility test Key: SPARK-27402 URL: https://issues.apache.org/jira/browse/SPARK-27402 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27398) Get rid of sun.nio.cs.StreamDecoder in CreateJacksonParser
[ https://issues.apache.org/jira/browse/SPARK-27398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27398. --- Resolution: Not A Problem > Get rid of sun.nio.cs.StreamDecoder in CreateJacksonParser > -- > > Key: SPARK-27398 > URL: https://issues.apache.org/jira/browse/SPARK-27398 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Trivial > > The CreateJacksonParser.getStreamDecoder method creates an instance of > ReadableByteChannel and returns the result as of sun.nio.cs.StreamDecoder. > This is unnecessary and overcomplicates the method. This code can be replaced > by: > {code:scala} > val bais = new ByteArrayInputStream(in, 0, length) > new InputStreamReader(bais, enc) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27352) Apply for translation of the Chinese version, I hope to get authorization!
[ https://issues.apache.org/jira/browse/SPARK-27352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811835#comment-16811835 ] Teng Peng commented on SPARK-27352: --- I would say go ahead and send a PR to add the link to the doc. This might be a good place for the link [https://spark.apache.org/docs/latest/index.html] > Apply for translation of the Chinese version, I hope to get authorization! > --- > > Key: SPARK-27352 > URL: https://issues.apache.org/jira/browse/SPARK-27352 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Yuan Yifan >Priority: Minor > > Hello everyone, we are [ApacheCN|https://www.apachecn.org/], an open-source > community in China, focusing on Big Data and AI. > Recently, we have been making progress on translating Spark documents. > - [Source Of Document|https://github.com/apachecn/spark-doc-zh] > - [Document Preview|http://spark.apachecn.org/] > There are several reasons: > *1. The English level of many Chinese users is not very good.* > *2. Network problems, you know (China's magic network)!* > *3. Online blogs are very messy.* > We are very willing to do some Chinese localization for your project. If > possible, please give us some authorization. > Yifan Yuan from Apache CN > You may contact me by mail [tsingjyuj...@163.com|mailto:tsingjyuj...@163.com] > for more details -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27352) Apply for translation of the Chinese version, I hope to get authorization!
[ https://issues.apache.org/jira/browse/SPARK-27352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811815#comment-16811815 ] Yuan Yifan commented on SPARK-27352: [~Teng Peng] Thank you for your attention on this wish, actually, translate the documents into Chinese will not against the license and don't have to get an "authorization". But we're looking forward to the deeper cooperation between Apache with us like put a "ZH Doc" super-link in http://spark.apache.org/ maybe it will provide our document to more Chinese users and easy their using of Spark. Thank you again for your attention on this, please feel free to mail/reply to me if there're any questions about it. > Apply for translation of the Chinese version, I hope to get authorization! > --- > > Key: SPARK-27352 > URL: https://issues.apache.org/jira/browse/SPARK-27352 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Yuan Yifan >Priority: Minor > > Hello everyone, we are [ApacheCN|https://www.apachecn.org/], an open-source > community in China, focusing on Big Data and AI. > Recently, we have been making progress on translating Spark documents. > - [Source Of Document|https://github.com/apachecn/spark-doc-zh] > - [Document Preview|http://spark.apachecn.org/] > There are several reasons: > *1. The English level of many Chinese users is not very good.* > *2. Network problems, you know (China's magic network)!* > *3. Online blogs are very messy.* > We are very willing to do some Chinese localization for your project. If > possible, please give us some authorization. > Yifan Yuan from Apache CN > You may contact me by mail [tsingjyuj...@163.com|mailto:tsingjyuj...@163.com] > for more details -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27278) Optimize GetMapValue when the map is a foldable and the key is not
[ https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811808#comment-16811808 ] Marco Gaido commented on SPARK-27278: - [~huonw] I think the point is: in the existing case which is optimized, the {{CreateMap}} operation is removed and replaced with the {{CASE ... WHEN}} syntax: this is fine because the code generated by the {{CreateMap}} operation is linear in size with the number of elements, exactly as for the {{CASE ... WHEN}} approach. So the overall code generated size is similar. While when the {{CreateMap}} operation is replaced with a {{Literal}}, this code is not there and we replace the for loop present with the list of if statements. So the code size is significantly bigger. Despite trivial tests show that the second case is faster, having a bigger code generated can lead to huge perf issues in more complex scenario (the worts cases are probably when it causes the method size to grow bigger than 4k so no JIT is performed anymore, or even it may cause the failure to use WholeStageCodegen): so the effect of introducing it in performance would be highly query-dependent and hard to predict. > Optimize GetMapValue when the map is a foldable and the key is not > -- > > Key: SPARK-27278 > URL: https://issues.apache.org/jira/browse/SPARK-27278 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 >Reporter: Huon Wilson >Priority: Minor > > With a map that isn't constant-foldable, spark will optimise an access to a > series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as > "x").explain > == Physical Plan == > *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L > as int) = 2) THEN id#180L END AS x#182L] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > This results in an efficient series of ifs and elses, in the code generation: > {code:java} > /* 037 */ boolean project_isNull_3 = false; > /* 038 */ int project_value_3 = -1; > /* 039 */ if (!false) { > /* 040 */ project_value_3 = (int) project_expr_0_0; > /* 041 */ } > /* 042 */ > /* 043 */ boolean project_value_2 = false; > /* 044 */ project_value_2 = project_value_3 == 1; > /* 045 */ if (!false && project_value_2) { > /* 046 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 047 */ project_project_value_1_0 = 1L; > /* 048 */ continue; > /* 049 */ } > /* 050 */ > /* 051 */ boolean project_isNull_8 = false; > /* 052 */ int project_value_8 = -1; > /* 053 */ if (!false) { > /* 054 */ project_value_8 = (int) project_expr_0_0; > /* 055 */ } > /* 056 */ > /* 057 */ boolean project_value_7 = false; > /* 058 */ project_value_7 = project_value_8 == 2; > /* 059 */ if (!false && project_value_7) { > /* 060 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 061 */ project_project_value_1_0 = project_expr_0_0; > /* 062 */ continue; > /* 063 */ } > {code} > If the map can be constant folded, the constant folding happens first, and > the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing > a map traversal and more dynamic checks: > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as > "x").explain > == Physical Plan == > *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which > is what is stored in the {{Literal}} form of the {{map(...)}} expression in > that select. The code generated is less efficient, since it has to do a > manual dynamic traversal of the map's array of keys, with type casts etc.: > {code:java} > /* 099 */ int project_index_0 = 0; > /* 100 */ boolean project_found_0 = false; > /* 101 */ while (project_index_0 < project_length_0 && > !project_found_0) { > /* 102 */ final int project_key_0 = > project_keys_0.getInt(project_index_0); > /* 103 */ if (project_key_0 == project_value_2) { > /* 104 */ project_found_0 = true; > /* 105 */ } else { > /* 106 */ project_index_0++; > /* 107 */ } > /* 108 */ } > /* 109 */ > /* 110 */ if (!project_found_0) { > /* 111 */ project_isNull_0 = true; > /* 112 */ } else { > /* 113 */ project_value_0 = > project_values_0.getInt(project_index_0); > /* 114 */ } > {code} > It
[jira] [Issue Comment Deleted] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] antonkulaga updated SPARK-14220: Comment: was deleted (was: I suggest to use Spark 2.4.1 as there Scala 2.12 is not longer experimental) > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Sean Owen >Priority: Blocker > Labels: release-notes > Fix For: 2.4.0 > > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811726#comment-16811726 ] antonkulaga commented on SPARK-14220: - I suggest to use Spark 2.4.1 as there Scala 2.12 is not longer experimental > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Sean Owen >Priority: Blocker > Labels: release-notes > Fix For: 2.4.0 > > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27399) Spark streaming of kafka 0.10 contains some scattered config
[ https://issues.apache.org/jira/browse/SPARK-27399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27399. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24267 [https://github.com/apache/spark/pull/24267] > Spark streaming of kafka 0.10 contains some scattered config > > > Key: SPARK-27399 > URL: https://issues.apache.org/jira/browse/SPARK-27399 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Minor > Fix For: 3.0.0 > > > I found a lot scattered config in Kafka streaming. > I think should arrange these config in unified position. > There are also exists some hardcode like > {code:java} > spark.network.timeout{code} > need to change. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27383) Avoid using hard-coded jar names in Hive tests
[ https://issues.apache.org/jira/browse/SPARK-27383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27383. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24294 [https://github.com/apache/spark/pull/24294] > Avoid using hard-coded jar names in Hive tests > -- > > Key: SPARK-27383 > URL: https://issues.apache.org/jira/browse/SPARK-27383 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > Avoid using hard-coded jar names({{hive-contrib-0.13.1.jar}} and > {{hive-hcatalog-core-0.13.1.jar}}) in Hive tests. This makes it easy to > change when upgrading the built-in Hive to 2.3.4. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27399) Spark streaming of kafka 0.10 contains some scattered config
[ https://issues.apache.org/jira/browse/SPARK-27399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27399: - Assignee: jiaan.geng > Spark streaming of kafka 0.10 contains some scattered config > > > Key: SPARK-27399 > URL: https://issues.apache.org/jira/browse/SPARK-27399 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0, 2.4.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Minor > > I found a lot scattered config in Kafka streaming. > I think should arrange these config in unified position. > There are also exists some hardcode like > {code:java} > spark.network.timeout{code} > need to change. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27383) Avoid using hard-coded jar names in Hive tests
[ https://issues.apache.org/jira/browse/SPARK-27383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27383: - Assignee: Yuming Wang Priority: Minor (was: Major) > Avoid using hard-coded jar names in Hive tests > -- > > Key: SPARK-27383 > URL: https://issues.apache.org/jira/browse/SPARK-27383 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Minor > Fix For: 3.0.0 > > > Avoid using hard-coded jar names({{hive-contrib-0.13.1.jar}} and > {{hive-hcatalog-core-0.13.1.jar}}) in Hive tests. This makes it easy to > change when upgrading the built-in Hive to 2.3.4. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27278) Optimize GetMapValue when the map is a foldable and the key is not
[ https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811720#comment-16811720 ] Dongjoon Hyun commented on SPARK-27278: --- The reverting PR should not reuse this JIRA because the purpose is different. This JIRA is dedicated to [~mgaido]'s improvement approach and his PR code. I prefer Marco's way and believe that you do. > Optimize GetMapValue when the map is a foldable and the key is not > -- > > Key: SPARK-27278 > URL: https://issues.apache.org/jira/browse/SPARK-27278 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 >Reporter: Huon Wilson >Priority: Minor > > With a map that isn't constant-foldable, spark will optimise an access to a > series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as > "x").explain > == Physical Plan == > *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L > as int) = 2) THEN id#180L END AS x#182L] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > This results in an efficient series of ifs and elses, in the code generation: > {code:java} > /* 037 */ boolean project_isNull_3 = false; > /* 038 */ int project_value_3 = -1; > /* 039 */ if (!false) { > /* 040 */ project_value_3 = (int) project_expr_0_0; > /* 041 */ } > /* 042 */ > /* 043 */ boolean project_value_2 = false; > /* 044 */ project_value_2 = project_value_3 == 1; > /* 045 */ if (!false && project_value_2) { > /* 046 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 047 */ project_project_value_1_0 = 1L; > /* 048 */ continue; > /* 049 */ } > /* 050 */ > /* 051 */ boolean project_isNull_8 = false; > /* 052 */ int project_value_8 = -1; > /* 053 */ if (!false) { > /* 054 */ project_value_8 = (int) project_expr_0_0; > /* 055 */ } > /* 056 */ > /* 057 */ boolean project_value_7 = false; > /* 058 */ project_value_7 = project_value_8 == 2; > /* 059 */ if (!false && project_value_7) { > /* 060 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 061 */ project_project_value_1_0 = project_expr_0_0; > /* 062 */ continue; > /* 063 */ } > {code} > If the map can be constant folded, the constant folding happens first, and > the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing > a map traversal and more dynamic checks: > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as > "x").explain > == Physical Plan == > *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which > is what is stored in the {{Literal}} form of the {{map(...)}} expression in > that select. The code generated is less efficient, since it has to do a > manual dynamic traversal of the map's array of keys, with type casts etc.: > {code:java} > /* 099 */ int project_index_0 = 0; > /* 100 */ boolean project_found_0 = false; > /* 101 */ while (project_index_0 < project_length_0 && > !project_found_0) { > /* 102 */ final int project_key_0 = > project_keys_0.getInt(project_index_0); > /* 103 */ if (project_key_0 == project_value_2) { > /* 104 */ project_found_0 = true; > /* 105 */ } else { > /* 106 */ project_index_0++; > /* 107 */ } > /* 108 */ } > /* 109 */ > /* 110 */ if (!project_found_0) { > /* 111 */ project_isNull_0 = true; > /* 112 */ } else { > /* 113 */ project_value_0 = > project_values_0.getInt(project_index_0); > /* 114 */ } > {code} > It looks like the problem is in {{SimplifyExtractValueOps}}, which doesn't > handle {{GetMapValue(Literal(...), key)}}, only the {{CreateMap}} form: > {code:scala} > case GetMapValue(CreateMap(elems), key) => CaseKeyWhen(key, elems) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27278) Optimize GetMapValue when the map is a foldable and the key is not
[ https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811717#comment-16811717 ] Dongjoon Hyun commented on SPARK-27278: --- [~huonw]. You can make a PR (for reverting the old one) if you are uncomfortable. Your PR will be reviewed in the same review process; Pros and Cons. Something missed is also the current behavior, too. > Optimize GetMapValue when the map is a foldable and the key is not > -- > > Key: SPARK-27278 > URL: https://issues.apache.org/jira/browse/SPARK-27278 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 >Reporter: Huon Wilson >Priority: Minor > > With a map that isn't constant-foldable, spark will optimise an access to a > series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as > "x").explain > == Physical Plan == > *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L > as int) = 2) THEN id#180L END AS x#182L] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > This results in an efficient series of ifs and elses, in the code generation: > {code:java} > /* 037 */ boolean project_isNull_3 = false; > /* 038 */ int project_value_3 = -1; > /* 039 */ if (!false) { > /* 040 */ project_value_3 = (int) project_expr_0_0; > /* 041 */ } > /* 042 */ > /* 043 */ boolean project_value_2 = false; > /* 044 */ project_value_2 = project_value_3 == 1; > /* 045 */ if (!false && project_value_2) { > /* 046 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 047 */ project_project_value_1_0 = 1L; > /* 048 */ continue; > /* 049 */ } > /* 050 */ > /* 051 */ boolean project_isNull_8 = false; > /* 052 */ int project_value_8 = -1; > /* 053 */ if (!false) { > /* 054 */ project_value_8 = (int) project_expr_0_0; > /* 055 */ } > /* 056 */ > /* 057 */ boolean project_value_7 = false; > /* 058 */ project_value_7 = project_value_8 == 2; > /* 059 */ if (!false && project_value_7) { > /* 060 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 061 */ project_project_value_1_0 = project_expr_0_0; > /* 062 */ continue; > /* 063 */ } > {code} > If the map can be constant folded, the constant folding happens first, and > the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing > a map traversal and more dynamic checks: > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as > "x").explain > == Physical Plan == > *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which > is what is stored in the {{Literal}} form of the {{map(...)}} expression in > that select. The code generated is less efficient, since it has to do a > manual dynamic traversal of the map's array of keys, with type casts etc.: > {code:java} > /* 099 */ int project_index_0 = 0; > /* 100 */ boolean project_found_0 = false; > /* 101 */ while (project_index_0 < project_length_0 && > !project_found_0) { > /* 102 */ final int project_key_0 = > project_keys_0.getInt(project_index_0); > /* 103 */ if (project_key_0 == project_value_2) { > /* 104 */ project_found_0 = true; > /* 105 */ } else { > /* 106 */ project_index_0++; > /* 107 */ } > /* 108 */ } > /* 109 */ > /* 110 */ if (!project_found_0) { > /* 111 */ project_isNull_0 = true; > /* 112 */ } else { > /* 113 */ project_value_0 = > project_values_0.getInt(project_index_0); > /* 114 */ } > {code} > It looks like the problem is in {{SimplifyExtractValueOps}}, which doesn't > handle {{GetMapValue(Literal(...), key)}}, only the {{CreateMap}} form: > {code:scala} > case GetMapValue(CreateMap(elems), key) => CaseKeyWhen(key, elems) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27352) Apply for translation of the Chinese version, I hope to get authorization!
[ https://issues.apache.org/jira/browse/SPARK-27352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811765#comment-16811765 ] Teng Peng commented on SPARK-27352: --- Correct me if I am wrong. I do not think any authorization are required for translation to other languages. > Apply for translation of the Chinese version, I hope to get authorization! > --- > > Key: SPARK-27352 > URL: https://issues.apache.org/jira/browse/SPARK-27352 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Yuan Yifan >Priority: Minor > > Hello everyone, we are [ApacheCN|https://www.apachecn.org/], an open-source > community in China, focusing on Big Data and AI. > Recently, we have been making progress on translating Spark documents. > - [Source Of Document|https://github.com/apachecn/spark-doc-zh] > - [Document Preview|http://spark.apachecn.org/] > There are several reasons: > *1. The English level of many Chinese users is not very good.* > *2. Network problems, you know (China's magic network)!* > *3. Online blogs are very messy.* > We are very willing to do some Chinese localization for your project. If > possible, please give us some authorization. > Yifan Yuan from Apache CN > You may contact me by mail [tsingjyuj...@163.com|mailto:tsingjyuj...@163.com] > for more details -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26992) Fix STS scheduler pool correct delivery
[ https://issues.apache.org/jira/browse/SPARK-26992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26992: - Assignee: dzcxzl > Fix STS scheduler pool correct delivery > --- > > Key: SPARK-26992 > URL: https://issues.apache.org/jira/browse/SPARK-26992 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.4.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Attachments: error_session.png, error_stage.png > > > The user sets the value of spark.sql.thriftserver.scheduler.pool. > Spark thrift server saves this value in the LocalProperty of threadlocal > type, but does not clean up after running, causing other sessions to run in > the previously set pool name. > > For example > The second session does not manually set the pool name. The default pool name > should be used, but the pool name of the previous user's settings is used. > This is incorrect. > !error_session.png! > > !error_stage.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26992) Fix STS scheduler pool correct delivery
[ https://issues.apache.org/jira/browse/SPARK-26992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26992. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23895 [https://github.com/apache/spark/pull/23895] > Fix STS scheduler pool correct delivery > --- > > Key: SPARK-26992 > URL: https://issues.apache.org/jira/browse/SPARK-26992 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.4.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Fix For: 3.0.0 > > Attachments: error_session.png, error_stage.png > > > The user sets the value of spark.sql.thriftserver.scheduler.pool. > Spark thrift server saves this value in the LocalProperty of threadlocal > type, but does not clean up after running, causing other sessions to run in > the previously set pool name. > > For example > The second session does not manually set the pool name. The default pool name > should be used, but the pool name of the previous user's settings is used. > This is incorrect. > !error_session.png! > > !error_stage.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org