[jira] [Commented] (SPARK-28522) Pass dynamic parameters to custom file input format
[ https://issues.apache.org/jira/browse/SPARK-28522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894942#comment-16894942 ] Ayan Mukherjee commented on SPARK-28522: Thanks for the response but I am trying to pass some non hadoop parameters, which I need to use in my custom file input format. > Pass dynamic parameters to custom file input format > --- > > Key: SPARK-28522 > URL: https://issues.apache.org/jira/browse/SPARK-28522 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.0 >Reporter: Ayan Mukherjee >Priority: Major > > We have developed a custom file input format and calling it in pyspark using > newAPIHadoopFile option. It appears there is no option to pass parameters > dynamically to the custom format. > > rdd2 = sc.newAPIHadoopFile("/abcd/efgh/i1.txt", > "com.test1.TEST2.TESTInputFormat", "org.apache.hadoop.io.Text", > "org.apache.hadoop.io.NullWritable") -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23160) Port window.sql
[ https://issues.apache.org/jira/browse/SPARK-23160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23160: -- Summary: Port window.sql (was: Add window.sql) > Port window.sql > --- > > Key: SPARK-23160 > URL: https://issues.apache.org/jira/browse/SPARK-23160 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Priority: Minor > > In this ticket, we plan to add the regression test cases of > https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/window.sql. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28086) Adds `random()` sql function
[ https://issues.apache.org/jira/browse/SPARK-28086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894922#comment-16894922 ] Dongjoon Hyun commented on SPARK-28086: --- This issue is reported twice at [~DylanGuedes]'s PR (https://github.com/apache/spark/pull/24881/files#diff-14489bae6b27814d4cde0456a7ae75c8R702) and [~yumwang]'s PR (https://github.com/apache/spark/pull/25163/files#diff-23a3430e0e1ff88830cbb43701da1f2cR402). For me, PostgreSQL random function is the same with Apache Spark `rand` as a uniform random returning 0.0 <= x < 1.0. - https://www.postgresql.org/docs/8.2/functions-math.html Also, Spark also accepts `order by rand()` like the following. {code} spark-sql> SELECT rank() OVER (ORDER BY rank() OVER (ORDER BY rand())); 1 {code} So, let's make an alias and unblock the other issues. I'll make a PR. > Adds `random()` sql function > > > Key: SPARK-28086 > URL: https://issues.apache.org/jira/browse/SPARK-28086 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dylan Guedes >Priority: Major > > Currently, Spark does not have a `random()` function. Postgres, however, does. > For instance, this one is not valid: > {code:sql} > SELECT rank() OVER (ORDER BY rank() OVER (ORDER BY random())) > {code} > Because of the `random()` call. On the other hand, [Postgres has > it.|https://www.postgresql.org/docs/8.2/functions-math.html] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28522) Pass dynamic parameters to custom file input format
[ https://issues.apache.org/jira/browse/SPARK-28522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894909#comment-16894909 ] Hyukjin Kwon commented on SPARK-28522: -- {{sc.hadoopConfiguration.set("my.mapreduce.setting","someVal")}} > Pass dynamic parameters to custom file input format > --- > > Key: SPARK-28522 > URL: https://issues.apache.org/jira/browse/SPARK-28522 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.0 >Reporter: Ayan Mukherjee >Priority: Major > > We have developed a custom file input format and calling it in pyspark using > newAPIHadoopFile option. It appears there is no option to pass parameters > dynamically to the custom format. > > rdd2 = sc.newAPIHadoopFile("/abcd/efgh/i1.txt", > "com.test1.TEST2.TESTInputFormat", "org.apache.hadoop.io.Text", > "org.apache.hadoop.io.NullWritable") -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28471) Formatting dates with negative years
[ https://issues.apache.org/jira/browse/SPARK-28471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28471. --- Resolution: Fixed Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/25230 > Formatting dates with negative years > > > Key: SPARK-28471 > URL: https://issues.apache.org/jira/browse/SPARK-28471 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > While converting dates with negative years to strings, Spark skips era > sub-field by default. That's can confuse users since years from BC era are > mirrored to current era. For example: > {code} > spark-sql> select make_date(-44, 3, 15); > 0045-03-15 > {code} > Even negative years are out of supported range by the DATE type, it would be > nice to indicate the era for such dates. > PostgreSQL outputs the era for such inputs: > {code} > # select make_date(-44, 3, 15); >make_date > --- > 0044-03-15 BC > (1 row) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28546) Why does the File Sink operation of Spark 2.4 Structured Streaming include double-level version validation?
[ https://issues.apache.org/jira/browse/SPARK-28546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894900#comment-16894900 ] Hyukjin Kwon commented on SPARK-28546: -- [~yy3b2007com], questions should better go to the mailing list. Let's interact there before filing it as an issue if you're not sure on that. > Why does the File Sink operation of Spark 2.4 Structured Streaming include > double-level version validation? > --- > > Key: SPARK-28546 > URL: https://issues.apache.org/jira/browse/SPARK-28546 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Spark 2.4 > Structured Streaming >Reporter: tommy duan >Priority: Major > > My code is as follows: > {code:java} > Dataset dataset = this.sparkSession.readStream().format("kafka") > .options(this.getSparkKafkaCommonOptions(sparkSession)) > .option("kafka.bootstrap.servers", "192.168.1.1:9092,192.168.1.2:9092") > .option("subscribe", "myTopic1,myTopic2") > .option("startingOffsets", "earliest") > .load();{code} > {code:java} > String mdtTempView = "mybasetemp"; > ExpressionEncoder Rowencoder = this.getSchemaEncoder(new > Schema.Parser().parse(baseschema.getValue())); > Dataset parseVal = dataset.select("value").as(Encoders.BINARY()) > .map(new MapFunction(){ > > }, Rowencoder) > .createOrReplaceGlobalTempView(mdtTempView); > > Dataset queryResult = this.sparkSession.sql("select 。。。 from > global_temp." + mdtTempView + " where start_time<>\"\""); > String savePath= "/user/dx/streaming/data/testapp"; > String checkpointLocation= "/user/dx/streaming/checkpoint/testapp"; > StreamingQuery query = queryResult.writeStream().format("parquet") > .option("path", savePath) > .option("checkpointLocation", checkpointLocation) > .partitionBy("month", "day", "hour") > .outputMode(OutputMode.Append()) > .trigger(Trigger.ProcessingTime(1, TimeUnit.MINUTES)) > .start(); > try { > query.awaitTermination(); > } catch (StreamingQueryException e) { > e.printStackTrace(); > } > {code} > > 1) When I first ran it, I found that app could run normally. > 2) Then, for some reason, I deleted the checkpoint directory of structured > streaming and did not delete the savepath of sink file which saves HDFS files. > 3) Then restart app, at which time only executor was assigned after app > started, and no tasks were assigned. In the log, I found the print message: > "INFO streaming. FileStream Sink: Skipping already committed batch 72". Later > I looked at the source code and found that the log was from > [https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L108] > 4) The 3) situation lasts for several hours before the DAGScheduler is > triggered to divide the DAG, submitStages, submitTasks, and tasks are > assigned to the executor. > Later, I read the > [https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala] > code carefully, and realized that in FileStreamSink, a log would be included > under savepath/_spark_metadata, if the current batchId<=log. getLatest () > will skip saving and output the log directly: logInfo (s "Skipping already > committed batch $batchId"). > > {code:java} > class FileStreamSink( > sparkSession: SparkSession, > path: String, > fileFormat: FileFormat, > partitionColumnNames: Seq[String], > options: Map[String, String]) extends Sink with Logging { > private val basePath = new Path(path) > private val logPath = new Path(basePath, FileStreamSink.metadataDir) > private val fileLog = > new FileStreamSinkLog(FileStreamSinkLog.VERSION, sparkSession, > logPath.toUri.toString) > > override def addBatch(batchId: Long, data: DataFrame): Unit = { >if (batchId <= fileLog.getLatest().map(_._1).getOrElse(-1L)) { > logInfo(s"Skipping already committed batch $batchId") >} else { > // save file to hdfs >} > } > //... > } > {code} > > I think that since checkpoint is used, all information control rights should > be given to checkpoint, and there should not be a batchId log information > record. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28546) Why does the File Sink operation of Spark 2.4 Structured Streaming include double-level version validation?
[ https://issues.apache.org/jira/browse/SPARK-28546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28546. -- Resolution: Invalid > Why does the File Sink operation of Spark 2.4 Structured Streaming include > double-level version validation? > --- > > Key: SPARK-28546 > URL: https://issues.apache.org/jira/browse/SPARK-28546 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Spark 2.4 > Structured Streaming >Reporter: tommy duan >Priority: Major > > My code is as follows: > {code:java} > Dataset dataset = this.sparkSession.readStream().format("kafka") > .options(this.getSparkKafkaCommonOptions(sparkSession)) > .option("kafka.bootstrap.servers", "192.168.1.1:9092,192.168.1.2:9092") > .option("subscribe", "myTopic1,myTopic2") > .option("startingOffsets", "earliest") > .load();{code} > {code:java} > String mdtTempView = "mybasetemp"; > ExpressionEncoder Rowencoder = this.getSchemaEncoder(new > Schema.Parser().parse(baseschema.getValue())); > Dataset parseVal = dataset.select("value").as(Encoders.BINARY()) > .map(new MapFunction(){ > > }, Rowencoder) > .createOrReplaceGlobalTempView(mdtTempView); > > Dataset queryResult = this.sparkSession.sql("select 。。。 from > global_temp." + mdtTempView + " where start_time<>\"\""); > String savePath= "/user/dx/streaming/data/testapp"; > String checkpointLocation= "/user/dx/streaming/checkpoint/testapp"; > StreamingQuery query = queryResult.writeStream().format("parquet") > .option("path", savePath) > .option("checkpointLocation", checkpointLocation) > .partitionBy("month", "day", "hour") > .outputMode(OutputMode.Append()) > .trigger(Trigger.ProcessingTime(1, TimeUnit.MINUTES)) > .start(); > try { > query.awaitTermination(); > } catch (StreamingQueryException e) { > e.printStackTrace(); > } > {code} > > 1) When I first ran it, I found that app could run normally. > 2) Then, for some reason, I deleted the checkpoint directory of structured > streaming and did not delete the savepath of sink file which saves HDFS files. > 3) Then restart app, at which time only executor was assigned after app > started, and no tasks were assigned. In the log, I found the print message: > "INFO streaming. FileStream Sink: Skipping already committed batch 72". Later > I looked at the source code and found that the log was from > [https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L108] > 4) The 3) situation lasts for several hours before the DAGScheduler is > triggered to divide the DAG, submitStages, submitTasks, and tasks are > assigned to the executor. > Later, I read the > [https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala] > code carefully, and realized that in FileStreamSink, a log would be included > under savepath/_spark_metadata, if the current batchId<=log. getLatest () > will skip saving and output the log directly: logInfo (s "Skipping already > committed batch $batchId"). > > {code:java} > class FileStreamSink( > sparkSession: SparkSession, > path: String, > fileFormat: FileFormat, > partitionColumnNames: Seq[String], > options: Map[String, String]) extends Sink with Logging { > private val basePath = new Path(path) > private val logPath = new Path(basePath, FileStreamSink.metadataDir) > private val fileLog = > new FileStreamSinkLog(FileStreamSinkLog.VERSION, sparkSession, > logPath.toUri.toString) > > override def addBatch(batchId: Long, data: DataFrame): Unit = { >if (batchId <= fileLog.getLatest().map(_._1).getOrElse(-1L)) { > logInfo(s"Skipping already committed batch $batchId") >} else { > // save file to hdfs >} > } > //... > } > {code} > > I think that since checkpoint is used, all information control rights should > be given to checkpoint, and there should not be a batchId log information > record. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28549) Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils`
[ https://issues.apache.org/jira/browse/SPARK-28549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28549. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25281 [https://github.com/apache/spark/pull/25281] > Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils` > -- > > Key: SPARK-28549 > URL: https://issues.apache.org/jira/browse/SPARK-28549 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > `org.apache.commons.lang3.StringEscapeUtils` is deprecated over two years ago > at LANG-1316. > {code} > /** > * Escapes and unescapes {@code String}s for > * Java, Java Script, HTML and XML. > * > * #ThreadSafe# > * @since 2.0 > * @deprecated as of 3.6, use commons-text > * href="https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html;> > * StringEscapeUtils instead > */ > @Deprecated > public class StringEscapeUtils { > {code} > This issue aims to use the latest one from `commons-text` module which has > more bug fixes like > TEXT-100, TEXT-118 and TEXT-120. > {code} > -import org.apache.commons.lang3.StringEscapeUtils > +import org.apache.commons.text.StringEscapeUtils > {code} > This will add a new dependency to `hadoop-2.7` profile distribution. > {code} > +commons-text-1.6.jar > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28549) Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils`
[ https://issues.apache.org/jira/browse/SPARK-28549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28549: Assignee: Dongjoon Hyun > Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils` > -- > > Key: SPARK-28549 > URL: https://issues.apache.org/jira/browse/SPARK-28549 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > `org.apache.commons.lang3.StringEscapeUtils` is deprecated over two years ago > at LANG-1316. > {code} > /** > * Escapes and unescapes {@code String}s for > * Java, Java Script, HTML and XML. > * > * #ThreadSafe# > * @since 2.0 > * @deprecated as of 3.6, use commons-text > * href="https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html;> > * StringEscapeUtils instead > */ > @Deprecated > public class StringEscapeUtils { > {code} > This issue aims to use the latest one from `commons-text` module which has > more bug fixes like > TEXT-100, TEXT-118 and TEXT-120. > {code} > -import org.apache.commons.lang3.StringEscapeUtils > +import org.apache.commons.text.StringEscapeUtils > {code} > This will add a new dependency to `hadoop-2.7` profile distribution. > {code} > +commons-text-1.6.jar > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28519) Tests failed on aarch64 due the value of math.log and power function is different
[ https://issues.apache.org/jira/browse/SPARK-28519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894876#comment-16894876 ] huangtianhua commented on SPARK-28519: -- Sorry, I didn't see you have proposed pr, thank you very much. > Tests failed on aarch64 due the value of math.log and power function is > different > - > > Key: SPARK-28519 > URL: https://issues.apache.org/jira/browse/SPARK-28519 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Major > > Sorry to disturb again, we ran unit tests on arm64 instance, and there are > other sql tests failed: > {code} > - pgSQL/float8.sql *** FAILED *** > Expected "{color:#f691b2}0.549306144334054[9]{color}", but got > "{color:#f691b2}0.549306144334054[8]{color}" Result did not match for query > #56 > SELECT atanh(double('0.5')) (SQLQueryTestSuite.scala:362) > - pgSQL/numeric.sql *** FAILED *** > Expected "2 {color:#59afe1}2247902679199174[72{color} > 224790267919917955.1326161858 > 4 7405685069595001 7405685069594999.0773399947 > 5 5068226527.321263 5068226527.3212726541 > 6 281839893606.99365 281839893606.9937234336 > 7 {color:#d04437}1716699575118595840{color} 1716699575118597095.4233081991 > 8 167361463828.0749 167361463828.0749132007 > 9 {color:#14892c}107511333880051856]{color} 107511333880052007", but got > "2 {color:#59afe1}2247902679199174[40{color} 224790267919917955.1326161858 > 4 7405685069595001 7405685069594999.0773399947 > 5 5068226527.321263 5068226527.3212726541 > 6 281839893606.99365 281839893606.9937234336 > 7 {color:#d04437}1716699575118595580{color} 1716699575118597095.4233081991 > 8 167361463828.0749 167361463828.0749132007 > 9 {color:#14892c}107511333880051872]{color} 107511333880052007" Result > did not match for query #496 > SELECT t1.id1, t1.result, t2.expected > FROM num_result t1, num_exp_power_10_ln t2 > WHERE t1.id1 = t2.id > AND t1.result != t2.expected (SQLQueryTestSuite.scala:362) > {code} > The first test failed, because the value of math.log(3.0) is different on > aarch64: > # on x86_64: > {code} > scala> math.log(3.0) > res50: Double = 1.0986122886681098 > {code} > # on aarch64: > {code} > scala> math.log(3.0) > res19: Double = 1.0986122886681096 > {code} > And I tried {{math.log(4.0)}}, {{math.log(5.0)}} and they are same, I don't > know why {{math.log(3.0)}} is so special? But the result is different indeed > on aarch64. > The second test failed, because some values of pow() is different on aarch64, > according to the test, I took tests on aarch64 and x86_64, take '-83028485' > as example: > # on x86_64: > {code} > scala> import java.lang.Math._ > import java.lang.Math._ > scala> abs(-83028485) > res3: Int = 83028485 > scala> var a = -83028485 > a: Int = -83028485 > scala> abs(a) > res4: Int = 83028485 > scala> math.log(abs(a)) > res5: Double = 18.234694299654787 > scala> pow(10, math.log(abs(a))) > res6: Double ={color:#d04437} 1.71669957511859584E18{color} > {code} > # on aarch64: > {code} > scala> var a = -83028485 > a: Int = -83028485 > scala> abs(a) > res38: Int = 83028485 > scala> math.log(abs(a)) > res39: Double = 18.234694299654787 > scala> pow(10, math.log(abs(a))) > res40: Double = 1.71669957511859558E18 > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28519) Tests failed on aarch64 due the value of math.log and power function is different
[ https://issues.apache.org/jira/browse/SPARK-28519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894854#comment-16894854 ] huangtianhua commented on SPARK-28519: -- Thank you all. I will test with modification and to see whether there are other similar tests fail, and will address them togother in one pull request. > Tests failed on aarch64 due the value of math.log and power function is > different > - > > Key: SPARK-28519 > URL: https://issues.apache.org/jira/browse/SPARK-28519 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Major > > Sorry to disturb again, we ran unit tests on arm64 instance, and there are > other sql tests failed: > {code} > - pgSQL/float8.sql *** FAILED *** > Expected "{color:#f691b2}0.549306144334054[9]{color}", but got > "{color:#f691b2}0.549306144334054[8]{color}" Result did not match for query > #56 > SELECT atanh(double('0.5')) (SQLQueryTestSuite.scala:362) > - pgSQL/numeric.sql *** FAILED *** > Expected "2 {color:#59afe1}2247902679199174[72{color} > 224790267919917955.1326161858 > 4 7405685069595001 7405685069594999.0773399947 > 5 5068226527.321263 5068226527.3212726541 > 6 281839893606.99365 281839893606.9937234336 > 7 {color:#d04437}1716699575118595840{color} 1716699575118597095.4233081991 > 8 167361463828.0749 167361463828.0749132007 > 9 {color:#14892c}107511333880051856]{color} 107511333880052007", but got > "2 {color:#59afe1}2247902679199174[40{color} 224790267919917955.1326161858 > 4 7405685069595001 7405685069594999.0773399947 > 5 5068226527.321263 5068226527.3212726541 > 6 281839893606.99365 281839893606.9937234336 > 7 {color:#d04437}1716699575118595580{color} 1716699575118597095.4233081991 > 8 167361463828.0749 167361463828.0749132007 > 9 {color:#14892c}107511333880051872]{color} 107511333880052007" Result > did not match for query #496 > SELECT t1.id1, t1.result, t2.expected > FROM num_result t1, num_exp_power_10_ln t2 > WHERE t1.id1 = t2.id > AND t1.result != t2.expected (SQLQueryTestSuite.scala:362) > {code} > The first test failed, because the value of math.log(3.0) is different on > aarch64: > # on x86_64: > {code} > scala> math.log(3.0) > res50: Double = 1.0986122886681098 > {code} > # on aarch64: > {code} > scala> math.log(3.0) > res19: Double = 1.0986122886681096 > {code} > And I tried {{math.log(4.0)}}, {{math.log(5.0)}} and they are same, I don't > know why {{math.log(3.0)}} is so special? But the result is different indeed > on aarch64. > The second test failed, because some values of pow() is different on aarch64, > according to the test, I took tests on aarch64 and x86_64, take '-83028485' > as example: > # on x86_64: > {code} > scala> import java.lang.Math._ > import java.lang.Math._ > scala> abs(-83028485) > res3: Int = 83028485 > scala> var a = -83028485 > a: Int = -83028485 > scala> abs(a) > res4: Int = 83028485 > scala> math.log(abs(a)) > res5: Double = 18.234694299654787 > scala> pow(10, math.log(abs(a))) > res6: Double ={color:#d04437} 1.71669957511859584E18{color} > {code} > # on aarch64: > {code} > scala> var a = -83028485 > a: Int = -83028485 > scala> abs(a) > res38: Int = 83028485 > scala> math.log(abs(a)) > res39: Double = 18.234694299654787 > scala> pow(10, math.log(abs(a))) > res40: Double = 1.71669957511859558E18 > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28519) Tests failed on aarch64 due the value of math.log and power function is different
[ https://issues.apache.org/jira/browse/SPARK-28519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894854#comment-16894854 ] huangtianhua edited comment on SPARK-28519 at 7/29/19 1:40 AM: --- Thank you all. I will test with modification and to see whether there are other similar tests fail, and will address them together in one pull request. was (Author: huangtianhua): Thank you all. I will test with modification and to see whether there are other similar tests fail, and will address them togother in one pull request. > Tests failed on aarch64 due the value of math.log and power function is > different > - > > Key: SPARK-28519 > URL: https://issues.apache.org/jira/browse/SPARK-28519 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Major > > Sorry to disturb again, we ran unit tests on arm64 instance, and there are > other sql tests failed: > {code} > - pgSQL/float8.sql *** FAILED *** > Expected "{color:#f691b2}0.549306144334054[9]{color}", but got > "{color:#f691b2}0.549306144334054[8]{color}" Result did not match for query > #56 > SELECT atanh(double('0.5')) (SQLQueryTestSuite.scala:362) > - pgSQL/numeric.sql *** FAILED *** > Expected "2 {color:#59afe1}2247902679199174[72{color} > 224790267919917955.1326161858 > 4 7405685069595001 7405685069594999.0773399947 > 5 5068226527.321263 5068226527.3212726541 > 6 281839893606.99365 281839893606.9937234336 > 7 {color:#d04437}1716699575118595840{color} 1716699575118597095.4233081991 > 8 167361463828.0749 167361463828.0749132007 > 9 {color:#14892c}107511333880051856]{color} 107511333880052007", but got > "2 {color:#59afe1}2247902679199174[40{color} 224790267919917955.1326161858 > 4 7405685069595001 7405685069594999.0773399947 > 5 5068226527.321263 5068226527.3212726541 > 6 281839893606.99365 281839893606.9937234336 > 7 {color:#d04437}1716699575118595580{color} 1716699575118597095.4233081991 > 8 167361463828.0749 167361463828.0749132007 > 9 {color:#14892c}107511333880051872]{color} 107511333880052007" Result > did not match for query #496 > SELECT t1.id1, t1.result, t2.expected > FROM num_result t1, num_exp_power_10_ln t2 > WHERE t1.id1 = t2.id > AND t1.result != t2.expected (SQLQueryTestSuite.scala:362) > {code} > The first test failed, because the value of math.log(3.0) is different on > aarch64: > # on x86_64: > {code} > scala> math.log(3.0) > res50: Double = 1.0986122886681098 > {code} > # on aarch64: > {code} > scala> math.log(3.0) > res19: Double = 1.0986122886681096 > {code} > And I tried {{math.log(4.0)}}, {{math.log(5.0)}} and they are same, I don't > know why {{math.log(3.0)}} is so special? But the result is different indeed > on aarch64. > The second test failed, because some values of pow() is different on aarch64, > according to the test, I took tests on aarch64 and x86_64, take '-83028485' > as example: > # on x86_64: > {code} > scala> import java.lang.Math._ > import java.lang.Math._ > scala> abs(-83028485) > res3: Int = 83028485 > scala> var a = -83028485 > a: Int = -83028485 > scala> abs(a) > res4: Int = 83028485 > scala> math.log(abs(a)) > res5: Double = 18.234694299654787 > scala> pow(10, math.log(abs(a))) > res6: Double ={color:#d04437} 1.71669957511859584E18{color} > {code} > # on aarch64: > {code} > scala> var a = -83028485 > a: Int = -83028485 > scala> abs(a) > res38: Int = 83028485 > scala> math.log(abs(a)) > res39: Double = 18.234694299654787 > scala> pow(10, math.log(abs(a))) > res40: Double = 1.71669957511859558E18 > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28547) Make it work for wide (> 10K columns data)
[ https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-28547. -- Resolution: Invalid > Make it work for wide (> 10K columns data) > -- > > Key: SPARK-28547 > URL: https://issues.apache.org/jira/browse/SPARK-28547 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4, 2.4.3 > Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per > node, 32 cores (tried different configurations of executors) >Reporter: antonkulaga >Priority: Critical > > Spark is super-slow for all wide data (when there are >15kb columns and >15kb > rows). Most of the genomics/transcriptomic data is wide because number of > genes is usually >20kb and number of samples ass well. Very popular GTEX > dataset is a good example ( see for instance RNA-Seq data at > https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is > just a .tsv file with two comments in the beginning). Everything done in wide > tables (even simple "describe" functions applied to all the genes-columns) > either takes hours or gets frozen (because of lost executors) irrespective of > memory and numbers of cores. While the same operations work fast (minutes) > and well with pure pandas (without any spark involved). > f -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28547) Make it work for wide (> 10K columns data)
[ https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894839#comment-16894839 ] Takeshi Yamamuro commented on SPARK-28547: -- You need to ask in the dev mailinglist first to narrow down the issue. We can do nothing based on the current description. > Make it work for wide (> 10K columns data) > -- > > Key: SPARK-28547 > URL: https://issues.apache.org/jira/browse/SPARK-28547 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4, 2.4.3 > Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per > node, 32 cores (tried different configurations of executors) >Reporter: antonkulaga >Priority: Critical > > Spark is super-slow for all wide data (when there are >15kb columns and >15kb > rows). Most of the genomics/transcriptomic data is wide because number of > genes is usually >20kb and number of samples ass well. Very popular GTEX > dataset is a good example ( see for instance RNA-Seq data at > https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is > just a .tsv file with two comments in the beginning). Everything done in wide > tables (even simple "describe" functions applied to all the genes-columns) > either takes hours or gets frozen (because of lost executors) irrespective of > memory and numbers of cores. While the same operations work fast (minutes) > and well with pure pandas (without any spark involved). > f -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28520) WholeStageCodegen does not work property for LocalTableScanExec
[ https://issues.apache.org/jira/browse/SPARK-28520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-28520. -- Resolution: Fixed Fix Version/s: 3.0.0 Target Version/s: (was: 3.0.0) Resolved by [https://github.com/apache/spark/pull/25260|https://github.com/apache/spark/pull/25260#issuecomment-515752501] > WholeStageCodegen does not work property for LocalTableScanExec > --- > > Key: SPARK-28520 > URL: https://issues.apache.org/jira/browse/SPARK-28520 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.0.0 > > > Code is not generated for LocalTableScanExec although proper situations. > If a LocalTableScanExec plan has the direct parent plan which supports > WholeStageCodegen, > the LocalTableScanExec plan also should be within a WholeStageCodegen domain. > But code is not generated for LocalTableScanExec and InputAdapter is inserted > for now. > {code} > val df1 = spark.createDataset(1 to 10).toDF > val df2 = spark.createDataset(1 to 10).toDF > val df3 = df1.join(df2, df1("value") === df2("value")) > df3.explain(true) > ... > == Physical Plan == > *(1) BroadcastHashJoin [value#1], [value#6], Inner, BuildRight > :- LocalTableScan [value#1] // > LocalTableScanExec is not within a WholeStageCodegen domain > +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > false] as bigint))) >+- LocalTableScan [value#6] > {code} > {code} > scala> df3.queryExecution.executedPlan.children.head.children.head.getClass > res4: Class[_ <: org.apache.spark.sql.execution.SparkPlan] = class > org.apache.spark.sql.execution.InputAdapter > {code} > For the current implementation of LocalTableScanExec, codegen is enabled in > case `parent` is not null > but `parent` is set in `consume`, which is called after `insertInputAdapter` > so it doesn't work as intended. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25474) Support `spark.sql.statistics.fallBackToHdfs` in data source tables
[ https://issues.apache.org/jira/browse/SPARK-25474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25474: - Assignee: shahid > Support `spark.sql.statistics.fallBackToHdfs` in data source tables > --- > > Key: SPARK-25474 > URL: https://issues.apache.org/jira/browse/SPARK-25474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.3 > Environment: Spark 2.3.1 > Hadoop 2.7.2 >Reporter: Ayush Anubhava >Assignee: shahid >Priority: Major > Fix For: 3.0.0 > > > *Description :* Size in bytes of the query is coming in EB in case of parquet > datasource. this would impact the performance , since join queries would > always go as Sort Merge Join. > *Precondition :* spark.sql.statistics.fallBackToHdfs = true > Steps: > {code:java} > 0: jdbc:hive2://10.xx:23040/default> create table t1110 (a int, b string) > using parquet PARTITIONED BY (b) ; > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (2,'b'); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (1,'a'); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.xx.xx:23040/default> select * from t1110; > +++--+ > | a | b | > +++--+ > | 1 | a | > | 2 | b | > +++--+ > {code} > *{color:#d04437}Cost of the query shows sizeInBytes in EB{color}* > {code:java} > explain cost select * from t1110; > | == Optimized Logical Plan == > Relation[a#23,b#24] parquet, Statistics(sizeInBytes=8.0 EB, hints=none) > == Physical Plan == > *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: Parquet, > Location: > CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], > PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: > struct | > {code} > *{color:#d04437}This would lead to Sort Merge Join in case of join > query{color}* > {code:java} > 0: jdbc:hive2://10.xx.xx:23040/default> create table t110 (a int, b string) > using parquet PARTITIONED BY (b) ; > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.xx.xx:23040/default> insert into t110 values (1,'a'); > +-+--+ > | Result | > +-+--+ > +-+--+ > explain select * from t1110 t1 join t110 t2 on t1.a=t2.a; > | == Physical Plan == > *(5) SortMergeJoin [a#23], [a#55], Inner > :- *(2) Sort [a#23 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(a#23, 200) > : +- *(1) Project [a#23, b#24] > : +- *(1) Filter isnotnull(a#23) > : +- *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: > Parquet, Location: > CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], > PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(a)], > ReadSchema: struct > +- *(4) Sort [a#55 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(a#55, 200) > +- *(3) Project [a#55, b#56] > +- *(3) Filter isnotnull(a#55) > +- *(3) FileScan parquet open.t110[a#55,b#56] Batched: true, Format: Parquet, > Location: > CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t110], > PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(a)], > ReadSchema: struct | > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25474) Support `spark.sql.statistics.fallBackToHdfs` in data source tables
[ https://issues.apache.org/jira/browse/SPARK-25474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25474. --- Resolution: Fixed Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/22502. > Support `spark.sql.statistics.fallBackToHdfs` in data source tables > --- > > Key: SPARK-25474 > URL: https://issues.apache.org/jira/browse/SPARK-25474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.3 > Environment: Spark 2.3.1 > Hadoop 2.7.2 >Reporter: Ayush Anubhava >Priority: Major > Fix For: 3.0.0 > > > *Description :* Size in bytes of the query is coming in EB in case of parquet > datasource. this would impact the performance , since join queries would > always go as Sort Merge Join. > *Precondition :* spark.sql.statistics.fallBackToHdfs = true > Steps: > {code:java} > 0: jdbc:hive2://10.xx:23040/default> create table t1110 (a int, b string) > using parquet PARTITIONED BY (b) ; > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (2,'b'); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (1,'a'); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.xx.xx:23040/default> select * from t1110; > +++--+ > | a | b | > +++--+ > | 1 | a | > | 2 | b | > +++--+ > {code} > *{color:#d04437}Cost of the query shows sizeInBytes in EB{color}* > {code:java} > explain cost select * from t1110; > | == Optimized Logical Plan == > Relation[a#23,b#24] parquet, Statistics(sizeInBytes=8.0 EB, hints=none) > == Physical Plan == > *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: Parquet, > Location: > CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], > PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: > struct | > {code} > *{color:#d04437}This would lead to Sort Merge Join in case of join > query{color}* > {code:java} > 0: jdbc:hive2://10.xx.xx:23040/default> create table t110 (a int, b string) > using parquet PARTITIONED BY (b) ; > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.xx.xx:23040/default> insert into t110 values (1,'a'); > +-+--+ > | Result | > +-+--+ > +-+--+ > explain select * from t1110 t1 join t110 t2 on t1.a=t2.a; > | == Physical Plan == > *(5) SortMergeJoin [a#23], [a#55], Inner > :- *(2) Sort [a#23 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(a#23, 200) > : +- *(1) Project [a#23, b#24] > : +- *(1) Filter isnotnull(a#23) > : +- *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: > Parquet, Location: > CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], > PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(a)], > ReadSchema: struct > +- *(4) Sort [a#55 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(a#55, 200) > +- *(3) Project [a#55, b#56] > +- *(3) Filter isnotnull(a#55) > +- *(3) FileScan parquet open.t110[a#55,b#56] Batched: true, Format: Parquet, > Location: > CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t110], > PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(a)], > ReadSchema: struct | > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25474) Support `spark.sql.statistics.fallBackToHdfs` in data source tables
[ https://issues.apache.org/jira/browse/SPARK-25474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25474: -- Summary: Support `spark.sql.statistics.fallBackToHdfs` in data source tables (was: Size in bytes of the query is coming in EB in case of parquet datasource) > Support `spark.sql.statistics.fallBackToHdfs` in data source tables > --- > > Key: SPARK-25474 > URL: https://issues.apache.org/jira/browse/SPARK-25474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.3 > Environment: Spark 2.3.1 > Hadoop 2.7.2 >Reporter: Ayush Anubhava >Priority: Major > > *Description :* Size in bytes of the query is coming in EB in case of parquet > datasource. this would impact the performance , since join queries would > always go as Sort Merge Join. > *Precondition :* spark.sql.statistics.fallBackToHdfs = true > Steps: > {code:java} > 0: jdbc:hive2://10.xx:23040/default> create table t1110 (a int, b string) > using parquet PARTITIONED BY (b) ; > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (2,'b'); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (1,'a'); > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.xx.xx:23040/default> select * from t1110; > +++--+ > | a | b | > +++--+ > | 1 | a | > | 2 | b | > +++--+ > {code} > *{color:#d04437}Cost of the query shows sizeInBytes in EB{color}* > {code:java} > explain cost select * from t1110; > | == Optimized Logical Plan == > Relation[a#23,b#24] parquet, Statistics(sizeInBytes=8.0 EB, hints=none) > == Physical Plan == > *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: Parquet, > Location: > CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], > PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: > struct | > {code} > *{color:#d04437}This would lead to Sort Merge Join in case of join > query{color}* > {code:java} > 0: jdbc:hive2://10.xx.xx:23040/default> create table t110 (a int, b string) > using parquet PARTITIONED BY (b) ; > +-+--+ > | Result | > +-+--+ > +-+--+ > 0: jdbc:hive2://10.xx.xx:23040/default> insert into t110 values (1,'a'); > +-+--+ > | Result | > +-+--+ > +-+--+ > explain select * from t1110 t1 join t110 t2 on t1.a=t2.a; > | == Physical Plan == > *(5) SortMergeJoin [a#23], [a#55], Inner > :- *(2) Sort [a#23 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(a#23, 200) > : +- *(1) Project [a#23, b#24] > : +- *(1) Filter isnotnull(a#23) > : +- *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: > Parquet, Location: > CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], > PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(a)], > ReadSchema: struct > +- *(4) Sort [a#55 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(a#55, 200) > +- *(3) Project [a#55, b#56] > +- *(3) Filter isnotnull(a#55) > +- *(3) FileScan parquet open.t110[a#55,b#56] Batched: true, Format: Parquet, > Location: > CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t110], > PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(a)], > ReadSchema: struct | > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28306) Once optimizer rule NormalizeFloatingNumbers is not idempotent
[ https://issues.apache.org/jira/browse/SPARK-28306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28306: Issue Type: Sub-task (was: Improvement) Parent: SPARK-28528 > Once optimizer rule NormalizeFloatingNumbers is not idempotent > -- > > Key: SPARK-28306 > URL: https://issues.apache.org/jira/browse/SPARK-28306 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yesheng Ma >Assignee: Yesheng Ma >Priority: Major > Fix For: 3.0.0 > > > When the rule NormalizeFloatingNumbers is called multiple times, it will add > additional transform operator to an expression, which is not appropriate. To > fix it, we have to make it idempotent, i.e. yield the same logical plan > regardless of multiple runs. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28237) Idempotence checker for Idempotent batches in RuleExecutors
[ https://issues.apache.org/jira/browse/SPARK-28237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28237: Issue Type: Sub-task (was: Improvement) Parent: SPARK-28528 > Idempotence checker for Idempotent batches in RuleExecutors > --- > > Key: SPARK-28237 > URL: https://issues.apache.org/jira/browse/SPARK-28237 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yesheng Ma >Assignee: Yesheng Ma >Priority: Major > Fix For: 3.0.0 > > > The current {{RuleExecutor}} system contains two kinds of strategies: > {{Once}} and {{FixedPoint}}. The {{Once}} strategy is supposed to run once. > However, for particular rules (e.g. PullOutNondeterministic), they are > designed to be idempotent, but Spark currently lacks corresponding mechanism > to prevent such kind of non-idempotent behavior from happening. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28377) Fully support correlation names in the FROM clause
[ https://issues.apache.org/jira/browse/SPARK-28377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894788#comment-16894788 ] Dongjoon Hyun commented on SPARK-28377: --- You can increase the priority if you want, [~yumwang]. > Fully support correlation names in the FROM clause > -- > > Key: SPARK-28377 > URL: https://issues.apache.org/jira/browse/SPARK-28377 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Specifying a list of column names is not fully support. Example: > {code:sql} > create or replace temporary view J1_TBL as select * from > (values (1, 4, 'one'), (2, 3, 'two')) > as v(i, j, t); > create or replace temporary view J2_TBL as select * from > (values (1, -1), (2, 2)) > as v(i, k); > SELECT '' AS xxx, t1.a, t2.e > FROM J1_TBL t1 (a, b, c), J2_TBL t2 (d, e) > WHERE t1.a = t2.d; > {code} > PostgreSQL: > {noformat} > postgres=# SELECT '' AS xxx, t1.a, t2.e > postgres-# FROM J1_TBL t1 (a, b, c), J2_TBL t2 (d, e) > postgres-# WHERE t1.a = t2.d; > xxx | a | e > -+---+ > | 1 | -1 > | 2 | 2 > (2 rows) > {noformat} > Spark SQL: > {noformat} > spark-sql> SELECT '' AS xxx, t1.a, t2.e > > FROM J1_TBL t1 (a, b, c), J2_TBL t2 (d, e) > > WHERE t1.a = t2.d; > Error in query: cannot resolve '`t1.a`' given input columns: [a, b, c, d, e]; > line 3 pos 8; > 'Project [ AS xxx#21, 't1.a, 't2.e] > +- 'Filter ('t1.a = 't2.d) >+- Join Inner > :- Project [i#14 AS a#22, j#15 AS b#23, t#16 AS c#24] > : +- SubqueryAlias `t1` > : +- SubqueryAlias `j1_tbl` > :+- Project [i#14, j#15, t#16] > : +- Project [col1#11 AS i#14, col2#12 AS j#15, col3#13 AS > t#16] > : +- SubqueryAlias `v` > : +- LocalRelation [col1#11, col2#12, col3#13] > +- Project [i#19 AS d#25, k#20 AS e#26] > +- SubqueryAlias `t2` > +- SubqueryAlias `j2_tbl` >+- Project [i#19, k#20] > +- Project [col1#17 AS i#19, col2#18 AS k#20] > +- SubqueryAlias `v` > +- LocalRelation [col1#17, col2#18] > {noformat} > > *Feature ID*: E051-08 > [https://www.postgresql.org/docs/11/sql-expressions.html] > [https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/sqlref/src/tpc/db2z_correlationnames.html] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28549) Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils`
[ https://issues.apache.org/jira/browse/SPARK-28549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28549: -- Description: `org.apache.commons.lang3.StringEscapeUtils` is deprecated over two years ago at LANG-1316. {code} /** * Escapes and unescapes {@code String}s for * Java, Java Script, HTML and XML. * * #ThreadSafe# * @since 2.0 * @deprecated as of 3.6, use commons-text * https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html;> * StringEscapeUtils instead */ @Deprecated public class StringEscapeUtils { {code} This issue aims to use the latest one from `commons-text` module which has more bug fixes like TEXT-100, TEXT-118 and TEXT-120. {code} -import org.apache.commons.lang3.StringEscapeUtils +import org.apache.commons.text.StringEscapeUtils {code} This will add a new dependency to `hadoop-2.7` profile distribution. {code} +commons-text-1.6.jar {code} was: `org.apache.commons.lang3.StringEscapeUtils` is deprecated over two years ago at LANG-1316. {code} /** * Escapes and unescapes {@code String}s for * Java, Java Script, HTML and XML. * * #ThreadSafe# * @since 2.0 * @deprecated as of 3.6, use commons-text * https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html;> * StringEscapeUtils instead */ @Deprecated public class StringEscapeUtils { {code} This issue aims to use the latest one from `commons-text` module which has more bug fixes like TEXT-100, TEXT-118 and TEXT-120. {code} -import org.apache.commons.lang3.StringEscapeUtils +import org.apache.commons.text.StringEscapeUtils {code} > Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils` > -- > > Key: SPARK-28549 > URL: https://issues.apache.org/jira/browse/SPARK-28549 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > `org.apache.commons.lang3.StringEscapeUtils` is deprecated over two years ago > at LANG-1316. > {code} > /** > * Escapes and unescapes {@code String}s for > * Java, Java Script, HTML and XML. > * > * #ThreadSafe# > * @since 2.0 > * @deprecated as of 3.6, use commons-text > * href="https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html;> > * StringEscapeUtils instead > */ > @Deprecated > public class StringEscapeUtils { > {code} > This issue aims to use the latest one from `commons-text` module which has > more bug fixes like > TEXT-100, TEXT-118 and TEXT-120. > {code} > -import org.apache.commons.lang3.StringEscapeUtils > +import org.apache.commons.text.StringEscapeUtils > {code} > This will add a new dependency to `hadoop-2.7` profile distribution. > {code} > +commons-text-1.6.jar > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28549) Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils`
Dongjoon Hyun created SPARK-28549: - Summary: Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils` Key: SPARK-28549 URL: https://issues.apache.org/jira/browse/SPARK-28549 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 3.0.0 Reporter: Dongjoon Hyun `org.apache.commons.lang3.StringEscapeUtils` is deprecated over two years ago at LANG-1316. {code} /** * Escapes and unescapes {@code String}s for * Java, Java Script, HTML and XML. * * #ThreadSafe# * @since 2.0 * @deprecated as of 3.6, use commons-text * https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html;> * StringEscapeUtils instead */ @Deprecated public class StringEscapeUtils { {code} This issue aims to use the latest one from `commons-text` module which has more bug fixes like TEXT-100, TEXT-118 and TEXT-120. {code} -import org.apache.commons.lang3.StringEscapeUtils +import org.apache.commons.text.StringEscapeUtils {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28549) Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils`
[ https://issues.apache.org/jira/browse/SPARK-28549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28549: -- Component/s: Build > Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils` > -- > > Key: SPARK-28549 > URL: https://issues.apache.org/jira/browse/SPARK-28549 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > `org.apache.commons.lang3.StringEscapeUtils` is deprecated over two years ago > at LANG-1316. > {code} > /** > * Escapes and unescapes {@code String}s for > * Java, Java Script, HTML and XML. > * > * #ThreadSafe# > * @since 2.0 > * @deprecated as of 3.6, use commons-text > * href="https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html;> > * StringEscapeUtils instead > */ > @Deprecated > public class StringEscapeUtils { > {code} > This issue aims to use the latest one from `commons-text` module which has > more bug fixes like > TEXT-100, TEXT-118 and TEXT-120. > {code} > -import org.apache.commons.lang3.StringEscapeUtils > +import org.apache.commons.text.StringEscapeUtils > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28548) explain() shows wrong result for persisted DataFrames after some operations
Kousuke Saruta created SPARK-28548: -- Summary: explain() shows wrong result for persisted DataFrames after some operations Key: SPARK-28548 URL: https://issues.apache.org/jira/browse/SPARK-28548 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta After some operations against Datasets and then persist them, Dataset.explain shows wrong result. One of those operations is explain() itself. An example here. {code} val df = spark.range(10) df.explain df.persist df.explain {code} Expected result is like as follows. {code} == Physical Plan == *(1) ColumnarToRow +- InMemoryTableScan [id#7L] +- InMemoryRelation [id#7L], StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Range (0, 10, step=1, splits=12) {code} But I got this. {code} == Physical Plan == *(1) Range (0, 10, step=1, splits=12) {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28036) Built-in udf left/right has inconsistent behavior
[ https://issues.apache.org/jira/browse/SPARK-28036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877765#comment-16877765 ] ShuMing Li edited comment on SPARK-28036 at 7/28/19 2:28 PM: - [~shivuson...@gmail.com] It's not the same with `Postgres`: * In Postgres: ** left({{str}} {{text}}, {{n}} {{int}}): Return first {{n}} characters in the string. When {{n is negative, return all but last |n}}| characters; ** right({{str}} {{text}}, {{n}} {{int}}): Return last {{n}} characters in the string. When {{n}} is negative, return all but first |{{n}}| characters; * In Spark: ** left(str, len) - Returns the leftmost `len`(`len` can be string type) characters from the string `str`,if `len` is less or equal than 0 the result is an empty string; ** right(str, len) - Returns the rightmost `len`(`len` can be string type) characters from the string `str`,if `len` is less or equal than 0 the result is an empty string. They are different when `n`/`len` is negative. So maybe need to change Spark to adapt to Postgres's meaning. was (Author: lishuming): [~shivuson...@gmail.com] It's not the same with `Postgres`: * In Postgres: ** left({{str}} {{text}}, {{n}} {{int}}): Return first {{n}} characters in the string. When {{n }}is negative, return all but last |{{n}}| characters; ** right({{str}} {{text}}, {{n}} {{int}}): Return last {{n}} characters in the string. When {{n}} is negative, return all but first |{{n}}| characters; * In Spark: ** left(str, len) - Returns the leftmost `len`(`len` can be string type) characters from the string `str`,if `len` is less or equal than 0 the result is an empty string; ** right(str, len) - Returns the rightmost `len`(`len` can be string type) characters from the string `str`,if `len` is less or equal than 0 the result is an empty string. They are different when `n`/`len` is negative. So maybe need to change Spark to adapt to Postgres's meaning. > Built-in udf left/right has inconsistent behavior > - > > Key: SPARK-28036 > URL: https://issues.apache.org/jira/browse/SPARK-28036 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > PostgreSQL: > {code:sql} > postgres=# select left('ahoj', -2), right('ahoj', -2); > left | right > --+--- > ah | oj > (1 row) > {code} > Spark SQL: > {code:sql} > spark-sql> select left('ahoj', -2), right('ahoj', -2); > spark-sql> > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21481) Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF
[ https://issues.apache.org/jira/browse/SPARK-21481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-21481: - Assignee: Huaxin Gao > Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF > - > > Key: SPARK-21481 > URL: https://issues.apache.org/jira/browse/SPARK-21481 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0, 2.2.0 >Reporter: Aseem Bansal >Assignee: Huaxin Gao >Priority: Major > > If we want to find the index of any input based on hashing trick then it is > possible in > https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.mllib.feature.HashingTF > but not in > https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.feature.HashingTF. > Should allow that for feature parity -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21481) Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF
[ https://issues.apache.org/jira/browse/SPARK-21481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21481. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25250 [https://github.com/apache/spark/pull/25250] > Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF > - > > Key: SPARK-21481 > URL: https://issues.apache.org/jira/browse/SPARK-21481 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0, 2.2.0 >Reporter: Aseem Bansal >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > > If we want to find the index of any input based on hashing trick then it is > possible in > https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.mllib.feature.HashingTF > but not in > https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.feature.HashingTF. > Should allow that for feature parity -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28547) Make it work for wide (> 10K columns data)
[ https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] antonkulaga updated SPARK-28547: Description: Spark is super-slow for all wide data (when there are >15kb columns and >15kb rows). Most of the genomics/transcriptomic data is wide because number of genes is usually >20kb and number of samples ass well. Very popular GTEX dataset is a good example ( see for instance RNA-Seq data at https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just a .tsv file with two comments in the beginning). Everything done in wide tables (even simple "describe" functions applied to all the genes-columns) either takes ours or gets frozen (because of lost executors) irrespective of memory and numbers of cores. While the same operations work well with pure pandas (without any spark involved). f was: Spark is super-slow for all wide data (when there are >15kb columns and >15kb rows). Most of the genomics/transcriptomic data is wide because number of genes is usually >20kb and number of samples ass well. Very popular GTEX dataset is a good example ( see for instance RNA-Seq data at https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just a .tsv file with two comments in the beginning). Everything done in wide tables either takes ours or gets frozen (because of lost executors) irrespective of memory and numbers of cores. While the same operations work well with pure pandas (without any spark involved). f > Make it work for wide (> 10K columns data) > -- > > Key: SPARK-28547 > URL: https://issues.apache.org/jira/browse/SPARK-28547 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4, 2.4.3 > Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per > node, 32 cores (tried different configurations of executors) >Reporter: antonkulaga >Priority: Critical > > Spark is super-slow for all wide data (when there are >15kb columns and >15kb > rows). Most of the genomics/transcriptomic data is wide because number of > genes is usually >20kb and number of samples ass well. Very popular GTEX > dataset is a good example ( see for instance RNA-Seq data at > https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is > just a .tsv file with two comments in the beginning). Everything done in wide > tables (even simple "describe" functions applied to all the genes-columns) > either takes ours or gets frozen (because of lost executors) irrespective of > memory and numbers of cores. While the same operations work well with pure > pandas (without any spark involved). > f -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28547) Make it work for wide (> 10K columns data)
[ https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] antonkulaga updated SPARK-28547: Description: Spark is super-slow for all wide data (when there are >15kb columns and >15kb rows). Most of the genomics/transcriptomic data is wide because number of genes is usually >20kb and number of samples ass well. Very popular GTEX dataset is a good example ( see for instance RNA-Seq data at https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just a .tsv file with two comments in the beginning). Everything done in wide tables (even simple "describe" functions applied to all the genes-columns) either takes hours or gets frozen (because of lost executors) irrespective of memory and numbers of cores. While the same operations work fast (minutes) and well with pure pandas (without any spark involved). f was: Spark is super-slow for all wide data (when there are >15kb columns and >15kb rows). Most of the genomics/transcriptomic data is wide because number of genes is usually >20kb and number of samples ass well. Very popular GTEX dataset is a good example ( see for instance RNA-Seq data at https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just a .tsv file with two comments in the beginning). Everything done in wide tables (even simple "describe" functions applied to all the genes-columns) either takes ours or gets frozen (because of lost executors) irrespective of memory and numbers of cores. While the same operations work well with pure pandas (without any spark involved). f > Make it work for wide (> 10K columns data) > -- > > Key: SPARK-28547 > URL: https://issues.apache.org/jira/browse/SPARK-28547 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4, 2.4.3 > Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per > node, 32 cores (tried different configurations of executors) >Reporter: antonkulaga >Priority: Critical > > Spark is super-slow for all wide data (when there are >15kb columns and >15kb > rows). Most of the genomics/transcriptomic data is wide because number of > genes is usually >20kb and number of samples ass well. Very popular GTEX > dataset is a good example ( see for instance RNA-Seq data at > https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is > just a .tsv file with two comments in the beginning). Everything done in wide > tables (even simple "describe" functions applied to all the genes-columns) > either takes hours or gets frozen (because of lost executors) irrespective of > memory and numbers of cores. While the same operations work fast (minutes) > and well with pure pandas (without any spark involved). > f -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28547) Make it work for wide (> 10K columns data)
antonkulaga created SPARK-28547: --- Summary: Make it work for wide (> 10K columns data) Key: SPARK-28547 URL: https://issues.apache.org/jira/browse/SPARK-28547 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.3, 2.4.4 Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per node, 32 cores (tried different configurations of executors) Reporter: antonkulaga Spark is super-slow for all wide data (when there are >15kb columns and >15kb rows). Most of the genomics/transcriptomic data is wide because number of genes is usually >20kb and number of samples ass well. Very popular GTEX dataset is a good example ( see for instance RNA-Seq data at https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just a .tsv file with two comments in the beginning). Everything done in wide tables either takes ours or gets frozen (because of lost executors) irrespective of memory and numbers of cores. While the same operations work well with pure pandas (without any spark involved). f -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org