[jira] [Commented] (SPARK-28522) Pass dynamic parameters to custom file input format

2019-07-28 Thread Ayan Mukherjee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894942#comment-16894942
 ] 

Ayan Mukherjee commented on SPARK-28522:


Thanks for the response but I am trying to pass some non hadoop parameters, 
which I need to use in my custom file input format.

> Pass dynamic parameters to custom file input format
> ---
>
> Key: SPARK-28522
> URL: https://issues.apache.org/jira/browse/SPARK-28522
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.0
>Reporter: Ayan Mukherjee
>Priority: Major
>
> We have  developed a custom file input format and calling it in pyspark using 
> newAPIHadoopFile option. It appears there is no option to pass parameters 
> dynamically to the custom format.
>  
> rdd2 = sc.newAPIHadoopFile("/abcd/efgh/i1.txt", 
> "com.test1.TEST2.TESTInputFormat", "org.apache.hadoop.io.Text", 
> "org.apache.hadoop.io.NullWritable")



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23160) Port window.sql

2019-07-28 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23160:
--
Summary: Port window.sql  (was: Add window.sql)

> Port window.sql
> ---
>
> Key: SPARK-23160
> URL: https://issues.apache.org/jira/browse/SPARK-23160
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Priority: Minor
>
> In this ticket, we plan to add the regression test cases of 
> https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/window.sql.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28086) Adds `random()` sql function

2019-07-28 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894922#comment-16894922
 ] 

Dongjoon Hyun commented on SPARK-28086:
---

This issue is reported twice at [~DylanGuedes]'s PR 
(https://github.com/apache/spark/pull/24881/files#diff-14489bae6b27814d4cde0456a7ae75c8R702)
 and [~yumwang]'s PR 
(https://github.com/apache/spark/pull/25163/files#diff-23a3430e0e1ff88830cbb43701da1f2cR402).

For me, PostgreSQL random function is the same with Apache Spark `rand` as a 
uniform random returning 0.0 <= x < 1.0.
- https://www.postgresql.org/docs/8.2/functions-math.html

Also, Spark also accepts `order by rand()` like the following.
{code}
spark-sql> SELECT rank() OVER (ORDER BY rank() OVER (ORDER BY rand()));
1
{code}

So, let's make an alias and unblock the other issues. I'll make a PR.

> Adds `random()` sql function
> 
>
> Key: SPARK-28086
> URL: https://issues.apache.org/jira/browse/SPARK-28086
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
>
> Currently, Spark does not have a `random()` function. Postgres, however, does.
> For instance, this one is not valid:
> {code:sql}
> SELECT rank() OVER (ORDER BY rank() OVER (ORDER BY random()))
> {code}
> Because of the `random()` call. On the other hand, [Postgres has 
> it.|https://www.postgresql.org/docs/8.2/functions-math.html]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28522) Pass dynamic parameters to custom file input format

2019-07-28 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894909#comment-16894909
 ] 

Hyukjin Kwon commented on SPARK-28522:
--

{{sc.hadoopConfiguration.set("my.mapreduce.setting","someVal")}}

> Pass dynamic parameters to custom file input format
> ---
>
> Key: SPARK-28522
> URL: https://issues.apache.org/jira/browse/SPARK-28522
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.0
>Reporter: Ayan Mukherjee
>Priority: Major
>
> We have  developed a custom file input format and calling it in pyspark using 
> newAPIHadoopFile option. It appears there is no option to pass parameters 
> dynamically to the custom format.
>  
> rdd2 = sc.newAPIHadoopFile("/abcd/efgh/i1.txt", 
> "com.test1.TEST2.TESTInputFormat", "org.apache.hadoop.io.Text", 
> "org.apache.hadoop.io.NullWritable")



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28471) Formatting dates with negative years

2019-07-28 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28471.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/25230

> Formatting dates with negative years
> 
>
> Key: SPARK-28471
> URL: https://issues.apache.org/jira/browse/SPARK-28471
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> While converting dates with negative years to strings, Spark skips era 
> sub-field by default. That's can confuse users since years from BC era are 
> mirrored to current era. For example:
> {code}
> spark-sql> select make_date(-44, 3, 15);
> 0045-03-15
> {code}
> Even negative years are out of supported range by the DATE type, it would be 
> nice to indicate the era for such dates.
> PostgreSQL outputs the era for such inputs:
> {code}
> # select make_date(-44, 3, 15);
>make_date   
> ---
>  0044-03-15 BC
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28546) Why does the File Sink operation of Spark 2.4 Structured Streaming include double-level version validation?

2019-07-28 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894900#comment-16894900
 ] 

Hyukjin Kwon commented on SPARK-28546:
--

[~yy3b2007com], questions should better go to the mailing list. Let's interact 
there before filing it as an issue if you're not sure on that.

> Why does the File Sink operation of Spark 2.4 Structured Streaming include 
> double-level version validation?
> ---
>
> Key: SPARK-28546
> URL: https://issues.apache.org/jira/browse/SPARK-28546
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark 2.4
> Structured Streaming
>Reporter: tommy duan
>Priority: Major
>
> My code is as follows:
> {code:java}
> Dataset dataset = this.sparkSession.readStream().format("kafka")
>  .options(this.getSparkKafkaCommonOptions(sparkSession)) 
>  .option("kafka.bootstrap.servers", "192.168.1.1:9092,192.168.1.2:9092")
>  .option("subscribe", "myTopic1,myTopic2")
>  .option("startingOffsets", "earliest")
>  .load();{code}
> {code:java}
> String mdtTempView = "mybasetemp";
>  ExpressionEncoder Rowencoder = this.getSchemaEncoder(new 
> Schema.Parser().parse(baseschema.getValue())); 
>  Dataset parseVal = dataset.select("value").as(Encoders.BINARY())
>  .map(new MapFunction(){
>  
>  }, Rowencoder)
>  .createOrReplaceGlobalTempView(mdtTempView);
>  
>  Dataset queryResult = this.sparkSession.sql("select 。。。 from 
> global_temp." + mdtTempView + " where start_time<>\"\"");
>  String savePath= "/user/dx/streaming/data/testapp"; 
>  String checkpointLocation= "/user/dx/streaming/checkpoint/testapp";
>  StreamingQuery query = queryResult.writeStream().format("parquet")
>  .option("path", savePath)
>  .option("checkpointLocation", checkpointLocation)
>  .partitionBy("month", "day", "hour")
>  .outputMode(OutputMode.Append())
>  .trigger(Trigger.ProcessingTime(1, TimeUnit.MINUTES))
>  .start();
> try {
>  query.awaitTermination();
>  } catch (StreamingQueryException e) {
>  e.printStackTrace();
>  }
> {code}
>  
> 1) When I first ran it, I found that app could run normally.
> 2) Then, for some reason, I deleted the checkpoint directory of structured 
> streaming and did not delete the savepath of sink file which saves HDFS files.
> 3) Then restart app, at which time only executor was assigned after app 
> started, and no tasks were assigned. In the log, I found the print message: 
> "INFO streaming. FileStream Sink: Skipping already committed batch 72". Later 
> I looked at the source code and found that the log was from 
> [https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L108]
> 4) The 3) situation lasts for several hours before the DAGScheduler is 
> triggered to divide the DAG, submitStages, submitTasks, and tasks are 
> assigned to the executor.
> Later, I read the 
> [https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala]
>  code carefully, and realized that in FileStreamSink, a log would be included 
> under savepath/_spark_metadata, if the current batchId<=log. getLatest () 
> will skip saving and output the log directly: logInfo (s "Skipping already 
> committed batch $batchId").
>  
> {code:java}
> class FileStreamSink(
>  sparkSession: SparkSession,
>  path: String,
>  fileFormat: FileFormat,
>  partitionColumnNames: Seq[String],
>  options: Map[String, String]) extends Sink with Logging {
>  private val basePath = new Path(path)
>  private val logPath = new Path(basePath, FileStreamSink.metadataDir)
>  private val fileLog =
>  new FileStreamSinkLog(FileStreamSinkLog.VERSION, sparkSession, 
> logPath.toUri.toString)
>  
>  override def addBatch(batchId: Long, data: DataFrame): Unit = {
>if (batchId <= fileLog.getLatest().map(_._1).getOrElse(-1L)) {
>  logInfo(s"Skipping already committed batch $batchId")
>} else {
>  // save file to hdfs
>}
>  }
>  //...
> }
> {code}
>  
> I think that since checkpoint is used, all information control rights should 
> be given to checkpoint, and there should not be a batchId log information 
> record.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28546) Why does the File Sink operation of Spark 2.4 Structured Streaming include double-level version validation?

2019-07-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28546.
--
Resolution: Invalid

> Why does the File Sink operation of Spark 2.4 Structured Streaming include 
> double-level version validation?
> ---
>
> Key: SPARK-28546
> URL: https://issues.apache.org/jira/browse/SPARK-28546
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark 2.4
> Structured Streaming
>Reporter: tommy duan
>Priority: Major
>
> My code is as follows:
> {code:java}
> Dataset dataset = this.sparkSession.readStream().format("kafka")
>  .options(this.getSparkKafkaCommonOptions(sparkSession)) 
>  .option("kafka.bootstrap.servers", "192.168.1.1:9092,192.168.1.2:9092")
>  .option("subscribe", "myTopic1,myTopic2")
>  .option("startingOffsets", "earliest")
>  .load();{code}
> {code:java}
> String mdtTempView = "mybasetemp";
>  ExpressionEncoder Rowencoder = this.getSchemaEncoder(new 
> Schema.Parser().parse(baseschema.getValue())); 
>  Dataset parseVal = dataset.select("value").as(Encoders.BINARY())
>  .map(new MapFunction(){
>  
>  }, Rowencoder)
>  .createOrReplaceGlobalTempView(mdtTempView);
>  
>  Dataset queryResult = this.sparkSession.sql("select 。。。 from 
> global_temp." + mdtTempView + " where start_time<>\"\"");
>  String savePath= "/user/dx/streaming/data/testapp"; 
>  String checkpointLocation= "/user/dx/streaming/checkpoint/testapp";
>  StreamingQuery query = queryResult.writeStream().format("parquet")
>  .option("path", savePath)
>  .option("checkpointLocation", checkpointLocation)
>  .partitionBy("month", "day", "hour")
>  .outputMode(OutputMode.Append())
>  .trigger(Trigger.ProcessingTime(1, TimeUnit.MINUTES))
>  .start();
> try {
>  query.awaitTermination();
>  } catch (StreamingQueryException e) {
>  e.printStackTrace();
>  }
> {code}
>  
> 1) When I first ran it, I found that app could run normally.
> 2) Then, for some reason, I deleted the checkpoint directory of structured 
> streaming and did not delete the savepath of sink file which saves HDFS files.
> 3) Then restart app, at which time only executor was assigned after app 
> started, and no tasks were assigned. In the log, I found the print message: 
> "INFO streaming. FileStream Sink: Skipping already committed batch 72". Later 
> I looked at the source code and found that the log was from 
> [https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L108]
> 4) The 3) situation lasts for several hours before the DAGScheduler is 
> triggered to divide the DAG, submitStages, submitTasks, and tasks are 
> assigned to the executor.
> Later, I read the 
> [https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala]
>  code carefully, and realized that in FileStreamSink, a log would be included 
> under savepath/_spark_metadata, if the current batchId<=log. getLatest () 
> will skip saving and output the log directly: logInfo (s "Skipping already 
> committed batch $batchId").
>  
> {code:java}
> class FileStreamSink(
>  sparkSession: SparkSession,
>  path: String,
>  fileFormat: FileFormat,
>  partitionColumnNames: Seq[String],
>  options: Map[String, String]) extends Sink with Logging {
>  private val basePath = new Path(path)
>  private val logPath = new Path(basePath, FileStreamSink.metadataDir)
>  private val fileLog =
>  new FileStreamSinkLog(FileStreamSinkLog.VERSION, sparkSession, 
> logPath.toUri.toString)
>  
>  override def addBatch(batchId: Long, data: DataFrame): Unit = {
>if (batchId <= fileLog.getLatest().map(_._1).getOrElse(-1L)) {
>  logInfo(s"Skipping already committed batch $batchId")
>} else {
>  // save file to hdfs
>}
>  }
>  //...
> }
> {code}
>  
> I think that since checkpoint is used, all information control rights should 
> be given to checkpoint, and there should not be a batchId log information 
> record.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28549) Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils`

2019-07-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28549.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25281
[https://github.com/apache/spark/pull/25281]

> Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils`
> --
>
> Key: SPARK-28549
> URL: https://issues.apache.org/jira/browse/SPARK-28549
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> `org.apache.commons.lang3.StringEscapeUtils` is deprecated over two years ago 
> at LANG-1316.
> {code}
> /**
>  * Escapes and unescapes {@code String}s for
>  * Java, Java Script, HTML and XML.
>  *
>  * #ThreadSafe#
>  * @since 2.0
>  * @deprecated as of 3.6, use commons-text
>  *  href="https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html;>
>  * StringEscapeUtils instead
>  */
> @Deprecated
> public class StringEscapeUtils {
> {code}
> This issue aims to use the latest one from `commons-text` module which has 
> more bug fixes like 
> TEXT-100, TEXT-118 and TEXT-120.
> {code}
> -import org.apache.commons.lang3.StringEscapeUtils
> +import org.apache.commons.text.StringEscapeUtils
> {code}
> This will add a new dependency to `hadoop-2.7` profile distribution.
> {code}
> +commons-text-1.6.jar
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28549) Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils`

2019-07-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28549:


Assignee: Dongjoon Hyun

> Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils`
> --
>
> Key: SPARK-28549
> URL: https://issues.apache.org/jira/browse/SPARK-28549
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> `org.apache.commons.lang3.StringEscapeUtils` is deprecated over two years ago 
> at LANG-1316.
> {code}
> /**
>  * Escapes and unescapes {@code String}s for
>  * Java, Java Script, HTML and XML.
>  *
>  * #ThreadSafe#
>  * @since 2.0
>  * @deprecated as of 3.6, use commons-text
>  *  href="https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html;>
>  * StringEscapeUtils instead
>  */
> @Deprecated
> public class StringEscapeUtils {
> {code}
> This issue aims to use the latest one from `commons-text` module which has 
> more bug fixes like 
> TEXT-100, TEXT-118 and TEXT-120.
> {code}
> -import org.apache.commons.lang3.StringEscapeUtils
> +import org.apache.commons.text.StringEscapeUtils
> {code}
> This will add a new dependency to `hadoop-2.7` profile distribution.
> {code}
> +commons-text-1.6.jar
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28519) Tests failed on aarch64 due the value of math.log and power function is different

2019-07-28 Thread huangtianhua (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894876#comment-16894876
 ] 

huangtianhua commented on SPARK-28519:
--

Sorry, I didn't see you have proposed pr, thank you very much.

> Tests failed on aarch64 due the value of math.log and power function is 
> different
> -
>
> Key: SPARK-28519
> URL: https://issues.apache.org/jira/browse/SPARK-28519
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Major
>
> Sorry to disturb again, we ran unit tests on arm64 instance, and there are 
> other sql tests failed:
> {code}
>  - pgSQL/float8.sql *** FAILED ***
>  Expected "{color:#f691b2}0.549306144334054[9]{color}", but got 
> "{color:#f691b2}0.549306144334054[8]{color}" Result did not match for query 
> #56
>  SELECT atanh(double('0.5')) (SQLQueryTestSuite.scala:362)
>  - pgSQL/numeric.sql *** FAILED ***
>  Expected "2 {color:#59afe1}2247902679199174[72{color} 
> 224790267919917955.1326161858
>  4 7405685069595001 7405685069594999.0773399947
>  5 5068226527.321263 5068226527.3212726541
>  6 281839893606.99365 281839893606.9937234336
>  7 {color:#d04437}1716699575118595840{color} 1716699575118597095.4233081991
>  8 167361463828.0749 167361463828.0749132007
>  9 {color:#14892c}107511333880051856]{color} 107511333880052007", but got 
> "2 {color:#59afe1}2247902679199174[40{color} 224790267919917955.1326161858
>  4 7405685069595001 7405685069594999.0773399947
>  5 5068226527.321263 5068226527.3212726541
>  6 281839893606.99365 281839893606.9937234336
>  7 {color:#d04437}1716699575118595580{color} 1716699575118597095.4233081991
>  8 167361463828.0749 167361463828.0749132007
>  9 {color:#14892c}107511333880051872]{color} 107511333880052007" Result 
> did not match for query #496
>  SELECT t1.id1, t1.result, t2.expected
>  FROM num_result t1, num_exp_power_10_ln t2
>  WHERE t1.id1 = t2.id
>  AND t1.result != t2.expected (SQLQueryTestSuite.scala:362)
> {code}
> The first test failed, because the value of math.log(3.0) is different on 
> aarch64:
> # on x86_64:
> {code}
> scala> math.log(3.0)
> res50: Double = 1.0986122886681098
> {code}
> # on aarch64:
> {code}
> scala> math.log(3.0)
> res19: Double = 1.0986122886681096
> {code}
> And I tried {{math.log(4.0)}}, {{math.log(5.0)}} and they are same, I don't 
> know why {{math.log(3.0)}} is so special? But the result is different indeed 
> on aarch64.
> The second test failed, because some values of pow() is different on aarch64, 
> according to the test, I took tests on aarch64 and x86_64, take '-83028485' 
> as example:
> # on x86_64:
> {code}
> scala> import java.lang.Math._
> import java.lang.Math._
> scala> abs(-83028485)
> res3: Int = 83028485
> scala> var a = -83028485
> a: Int = -83028485
> scala> abs(a)
> res4: Int = 83028485
> scala> math.log(abs(a))
> res5: Double = 18.234694299654787
> scala> pow(10, math.log(abs(a)))
> res6: Double ={color:#d04437} 1.71669957511859584E18{color}
> {code}
> # on aarch64:
> {code}
> scala> var a = -83028485
> a: Int = -83028485
> scala> abs(a)
> res38: Int = 83028485
> scala> math.log(abs(a))
> res39: Double = 18.234694299654787
> scala> pow(10, math.log(abs(a)))
> res40: Double = 1.71669957511859558E18
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28519) Tests failed on aarch64 due the value of math.log and power function is different

2019-07-28 Thread huangtianhua (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894854#comment-16894854
 ] 

huangtianhua commented on SPARK-28519:
--

Thank you all. I will test with modification and to see whether there are other 
similar tests fail, and will address them togother in one pull request.

> Tests failed on aarch64 due the value of math.log and power function is 
> different
> -
>
> Key: SPARK-28519
> URL: https://issues.apache.org/jira/browse/SPARK-28519
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Major
>
> Sorry to disturb again, we ran unit tests on arm64 instance, and there are 
> other sql tests failed:
> {code}
>  - pgSQL/float8.sql *** FAILED ***
>  Expected "{color:#f691b2}0.549306144334054[9]{color}", but got 
> "{color:#f691b2}0.549306144334054[8]{color}" Result did not match for query 
> #56
>  SELECT atanh(double('0.5')) (SQLQueryTestSuite.scala:362)
>  - pgSQL/numeric.sql *** FAILED ***
>  Expected "2 {color:#59afe1}2247902679199174[72{color} 
> 224790267919917955.1326161858
>  4 7405685069595001 7405685069594999.0773399947
>  5 5068226527.321263 5068226527.3212726541
>  6 281839893606.99365 281839893606.9937234336
>  7 {color:#d04437}1716699575118595840{color} 1716699575118597095.4233081991
>  8 167361463828.0749 167361463828.0749132007
>  9 {color:#14892c}107511333880051856]{color} 107511333880052007", but got 
> "2 {color:#59afe1}2247902679199174[40{color} 224790267919917955.1326161858
>  4 7405685069595001 7405685069594999.0773399947
>  5 5068226527.321263 5068226527.3212726541
>  6 281839893606.99365 281839893606.9937234336
>  7 {color:#d04437}1716699575118595580{color} 1716699575118597095.4233081991
>  8 167361463828.0749 167361463828.0749132007
>  9 {color:#14892c}107511333880051872]{color} 107511333880052007" Result 
> did not match for query #496
>  SELECT t1.id1, t1.result, t2.expected
>  FROM num_result t1, num_exp_power_10_ln t2
>  WHERE t1.id1 = t2.id
>  AND t1.result != t2.expected (SQLQueryTestSuite.scala:362)
> {code}
> The first test failed, because the value of math.log(3.0) is different on 
> aarch64:
> # on x86_64:
> {code}
> scala> math.log(3.0)
> res50: Double = 1.0986122886681098
> {code}
> # on aarch64:
> {code}
> scala> math.log(3.0)
> res19: Double = 1.0986122886681096
> {code}
> And I tried {{math.log(4.0)}}, {{math.log(5.0)}} and they are same, I don't 
> know why {{math.log(3.0)}} is so special? But the result is different indeed 
> on aarch64.
> The second test failed, because some values of pow() is different on aarch64, 
> according to the test, I took tests on aarch64 and x86_64, take '-83028485' 
> as example:
> # on x86_64:
> {code}
> scala> import java.lang.Math._
> import java.lang.Math._
> scala> abs(-83028485)
> res3: Int = 83028485
> scala> var a = -83028485
> a: Int = -83028485
> scala> abs(a)
> res4: Int = 83028485
> scala> math.log(abs(a))
> res5: Double = 18.234694299654787
> scala> pow(10, math.log(abs(a)))
> res6: Double ={color:#d04437} 1.71669957511859584E18{color}
> {code}
> # on aarch64:
> {code}
> scala> var a = -83028485
> a: Int = -83028485
> scala> abs(a)
> res38: Int = 83028485
> scala> math.log(abs(a))
> res39: Double = 18.234694299654787
> scala> pow(10, math.log(abs(a)))
> res40: Double = 1.71669957511859558E18
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28519) Tests failed on aarch64 due the value of math.log and power function is different

2019-07-28 Thread huangtianhua (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894854#comment-16894854
 ] 

huangtianhua edited comment on SPARK-28519 at 7/29/19 1:40 AM:
---

Thank you all. I will test with modification and to see whether there are other 
similar tests fail, and will address them together in one pull request.


was (Author: huangtianhua):
Thank you all. I will test with modification and to see whether there are other 
similar tests fail, and will address them togother in one pull request.

> Tests failed on aarch64 due the value of math.log and power function is 
> different
> -
>
> Key: SPARK-28519
> URL: https://issues.apache.org/jira/browse/SPARK-28519
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Major
>
> Sorry to disturb again, we ran unit tests on arm64 instance, and there are 
> other sql tests failed:
> {code}
>  - pgSQL/float8.sql *** FAILED ***
>  Expected "{color:#f691b2}0.549306144334054[9]{color}", but got 
> "{color:#f691b2}0.549306144334054[8]{color}" Result did not match for query 
> #56
>  SELECT atanh(double('0.5')) (SQLQueryTestSuite.scala:362)
>  - pgSQL/numeric.sql *** FAILED ***
>  Expected "2 {color:#59afe1}2247902679199174[72{color} 
> 224790267919917955.1326161858
>  4 7405685069595001 7405685069594999.0773399947
>  5 5068226527.321263 5068226527.3212726541
>  6 281839893606.99365 281839893606.9937234336
>  7 {color:#d04437}1716699575118595840{color} 1716699575118597095.4233081991
>  8 167361463828.0749 167361463828.0749132007
>  9 {color:#14892c}107511333880051856]{color} 107511333880052007", but got 
> "2 {color:#59afe1}2247902679199174[40{color} 224790267919917955.1326161858
>  4 7405685069595001 7405685069594999.0773399947
>  5 5068226527.321263 5068226527.3212726541
>  6 281839893606.99365 281839893606.9937234336
>  7 {color:#d04437}1716699575118595580{color} 1716699575118597095.4233081991
>  8 167361463828.0749 167361463828.0749132007
>  9 {color:#14892c}107511333880051872]{color} 107511333880052007" Result 
> did not match for query #496
>  SELECT t1.id1, t1.result, t2.expected
>  FROM num_result t1, num_exp_power_10_ln t2
>  WHERE t1.id1 = t2.id
>  AND t1.result != t2.expected (SQLQueryTestSuite.scala:362)
> {code}
> The first test failed, because the value of math.log(3.0) is different on 
> aarch64:
> # on x86_64:
> {code}
> scala> math.log(3.0)
> res50: Double = 1.0986122886681098
> {code}
> # on aarch64:
> {code}
> scala> math.log(3.0)
> res19: Double = 1.0986122886681096
> {code}
> And I tried {{math.log(4.0)}}, {{math.log(5.0)}} and they are same, I don't 
> know why {{math.log(3.0)}} is so special? But the result is different indeed 
> on aarch64.
> The second test failed, because some values of pow() is different on aarch64, 
> according to the test, I took tests on aarch64 and x86_64, take '-83028485' 
> as example:
> # on x86_64:
> {code}
> scala> import java.lang.Math._
> import java.lang.Math._
> scala> abs(-83028485)
> res3: Int = 83028485
> scala> var a = -83028485
> a: Int = -83028485
> scala> abs(a)
> res4: Int = 83028485
> scala> math.log(abs(a))
> res5: Double = 18.234694299654787
> scala> pow(10, math.log(abs(a)))
> res6: Double ={color:#d04437} 1.71669957511859584E18{color}
> {code}
> # on aarch64:
> {code}
> scala> var a = -83028485
> a: Int = -83028485
> scala> abs(a)
> res38: Int = 83028485
> scala> math.log(abs(a))
> res39: Double = 18.234694299654787
> scala> pow(10, math.log(abs(a)))
> res40: Double = 1.71669957511859558E18
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-07-28 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-28547.
--
Resolution: Invalid

> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.3
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-07-28 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894839#comment-16894839
 ] 

Takeshi Yamamuro commented on SPARK-28547:
--

You need to ask in the dev mailinglist first to narrow down the issue. We can 
do nothing based on the current description.

> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.3
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28520) WholeStageCodegen does not work property for LocalTableScanExec

2019-07-28 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-28520.
--
  Resolution: Fixed
   Fix Version/s: 3.0.0
Target Version/s:   (was: 3.0.0)

Resolved by 
[https://github.com/apache/spark/pull/25260|https://github.com/apache/spark/pull/25260#issuecomment-515752501]

> WholeStageCodegen does not work property for LocalTableScanExec
> ---
>
> Key: SPARK-28520
> URL: https://issues.apache.org/jira/browse/SPARK-28520
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.0.0
>
>
> Code is not generated for LocalTableScanExec although proper situations.
> If a LocalTableScanExec plan has the direct parent plan which supports 
> WholeStageCodegen,
> the LocalTableScanExec plan also should be within a WholeStageCodegen domain.
> But code is not generated for LocalTableScanExec and InputAdapter is inserted 
> for now.
> {code}
> val df1 = spark.createDataset(1 to 10).toDF
> val df2 = spark.createDataset(1 to 10).toDF
> val df3 = df1.join(df2, df1("value") === df2("value"))
> df3.explain(true)
> ...
> == Physical Plan ==
> *(1) BroadcastHashJoin [value#1], [value#6], Inner, BuildRight
> :- LocalTableScan [value#1] // 
> LocalTableScanExec is not within a WholeStageCodegen domain
> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
> false] as bigint)))
>+- LocalTableScan [value#6]
> {code}
> {code}
> scala> df3.queryExecution.executedPlan.children.head.children.head.getClass
> res4: Class[_ <: org.apache.spark.sql.execution.SparkPlan] = class 
> org.apache.spark.sql.execution.InputAdapter
> {code}
> For the current implementation of LocalTableScanExec, codegen is enabled in 
> case `parent` is not null
> but `parent` is set in `consume`, which is called after `insertInputAdapter` 
> so it doesn't work as intended.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25474) Support `spark.sql.statistics.fallBackToHdfs` in data source tables

2019-07-28 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25474:
-

Assignee: shahid

> Support `spark.sql.statistics.fallBackToHdfs` in data source tables
> ---
>
> Key: SPARK-25474
> URL: https://issues.apache.org/jira/browse/SPARK-25474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
> Environment: Spark 2.3.1
> Hadoop 2.7.2
>Reporter: Ayush Anubhava
>Assignee: shahid
>Priority: Major
> Fix For: 3.0.0
>
>
> *Description :* Size in bytes of the query is coming in EB in case of parquet 
> datasource. this would impact the performance , since join queries would 
> always go as Sort Merge Join.
> *Precondition :* spark.sql.statistics.fallBackToHdfs = true
> Steps:
> {code:java}
> 0: jdbc:hive2://10.xx:23040/default> create table t1110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (2,'b');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> select * from t1110;
> +++--+
> | a | b |
> +++--+
> | 1 | a |
> | 2 | b |
> +++--+
> {code}
> *{color:#d04437}Cost of the query shows sizeInBytes in EB{color}*
> {code:java}
>  explain cost select * from t1110;
> | == Optimized Logical Plan ==
> Relation[a#23,b#24] parquet, Statistics(sizeInBytes=8.0 EB, hints=none)
> == Physical Plan ==
> *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct |
> {code}
> *{color:#d04437}This would lead to Sort Merge Join in case of join 
> query{color}*
> {code:java}
> 0: jdbc:hive2://10.xx.xx:23040/default> create table t110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> insert into t110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
>  explain select * from t1110 t1 join t110 t2 on t1.a=t2.a;
> | == Physical Plan ==
> *(5) SortMergeJoin [a#23], [a#55], Inner
> :- *(2) Sort [a#23 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(a#23, 200)
> : +- *(1) Project [a#23, b#24]
> : +- *(1) Filter isnotnull(a#23)
> : +- *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: 
> Parquet, Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct
> +- *(4) Sort [a#55 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(a#55, 200)
> +- *(3) Project [a#55, b#56]
> +- *(3) Filter isnotnull(a#55)
> +- *(3) FileScan parquet open.t110[a#55,b#56] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t110], 
> PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct |
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25474) Support `spark.sql.statistics.fallBackToHdfs` in data source tables

2019-07-28 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25474.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/22502.

> Support `spark.sql.statistics.fallBackToHdfs` in data source tables
> ---
>
> Key: SPARK-25474
> URL: https://issues.apache.org/jira/browse/SPARK-25474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
> Environment: Spark 2.3.1
> Hadoop 2.7.2
>Reporter: Ayush Anubhava
>Priority: Major
> Fix For: 3.0.0
>
>
> *Description :* Size in bytes of the query is coming in EB in case of parquet 
> datasource. this would impact the performance , since join queries would 
> always go as Sort Merge Join.
> *Precondition :* spark.sql.statistics.fallBackToHdfs = true
> Steps:
> {code:java}
> 0: jdbc:hive2://10.xx:23040/default> create table t1110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (2,'b');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> select * from t1110;
> +++--+
> | a | b |
> +++--+
> | 1 | a |
> | 2 | b |
> +++--+
> {code}
> *{color:#d04437}Cost of the query shows sizeInBytes in EB{color}*
> {code:java}
>  explain cost select * from t1110;
> | == Optimized Logical Plan ==
> Relation[a#23,b#24] parquet, Statistics(sizeInBytes=8.0 EB, hints=none)
> == Physical Plan ==
> *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct |
> {code}
> *{color:#d04437}This would lead to Sort Merge Join in case of join 
> query{color}*
> {code:java}
> 0: jdbc:hive2://10.xx.xx:23040/default> create table t110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> insert into t110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
>  explain select * from t1110 t1 join t110 t2 on t1.a=t2.a;
> | == Physical Plan ==
> *(5) SortMergeJoin [a#23], [a#55], Inner
> :- *(2) Sort [a#23 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(a#23, 200)
> : +- *(1) Project [a#23, b#24]
> : +- *(1) Filter isnotnull(a#23)
> : +- *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: 
> Parquet, Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct
> +- *(4) Sort [a#55 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(a#55, 200)
> +- *(3) Project [a#55, b#56]
> +- *(3) Filter isnotnull(a#55)
> +- *(3) FileScan parquet open.t110[a#55,b#56] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t110], 
> PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct |
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25474) Support `spark.sql.statistics.fallBackToHdfs` in data source tables

2019-07-28 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25474:
--
Summary: Support `spark.sql.statistics.fallBackToHdfs` in data source 
tables  (was: Size in bytes of the query is coming in EB in case of parquet 
datasource)

> Support `spark.sql.statistics.fallBackToHdfs` in data source tables
> ---
>
> Key: SPARK-25474
> URL: https://issues.apache.org/jira/browse/SPARK-25474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
> Environment: Spark 2.3.1
> Hadoop 2.7.2
>Reporter: Ayush Anubhava
>Priority: Major
>
> *Description :* Size in bytes of the query is coming in EB in case of parquet 
> datasource. this would impact the performance , since join queries would 
> always go as Sort Merge Join.
> *Precondition :* spark.sql.statistics.fallBackToHdfs = true
> Steps:
> {code:java}
> 0: jdbc:hive2://10.xx:23040/default> create table t1110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (2,'b');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.1xx:23040/default> insert into t1110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> select * from t1110;
> +++--+
> | a | b |
> +++--+
> | 1 | a |
> | 2 | b |
> +++--+
> {code}
> *{color:#d04437}Cost of the query shows sizeInBytes in EB{color}*
> {code:java}
>  explain cost select * from t1110;
> | == Optimized Logical Plan ==
> Relation[a#23,b#24] parquet, Statistics(sizeInBytes=8.0 EB, hints=none)
> == Physical Plan ==
> *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct |
> {code}
> *{color:#d04437}This would lead to Sort Merge Join in case of join 
> query{color}*
> {code:java}
> 0: jdbc:hive2://10.xx.xx:23040/default> create table t110 (a int, b string) 
> using parquet PARTITIONED BY (b) ;
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.xx.xx:23040/default> insert into t110 values (1,'a');
> +-+--+
> | Result |
> +-+--+
> +-+--+
>  explain select * from t1110 t1 join t110 t2 on t1.a=t2.a;
> | == Physical Plan ==
> *(5) SortMergeJoin [a#23], [a#55], Inner
> :- *(2) Sort [a#23 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(a#23, 200)
> : +- *(1) Project [a#23, b#24]
> : +- *(1) Filter isnotnull(a#23)
> : +- *(1) FileScan parquet open.t1110[a#23,b#24] Batched: true, Format: 
> Parquet, Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t1110], 
> PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct
> +- *(4) Sort [a#55 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(a#55, 200)
> +- *(3) Project [a#55, b#56]
> +- *(3) Filter isnotnull(a#55)
> +- *(3) FileScan parquet open.t110[a#55,b#56] Batched: true, Format: Parquet, 
> Location: 
> CatalogFileIndex[hdfs://hacluster/user/sparkhive/warehouse/open.db/t110], 
> PartitionCount: 1, PartitionFilters: [], PushedFilters: [IsNotNull(a)], 
> ReadSchema: struct |
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28306) Once optimizer rule NormalizeFloatingNumbers is not idempotent

2019-07-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28306:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-28528

> Once optimizer rule NormalizeFloatingNumbers is not idempotent
> --
>
> Key: SPARK-28306
> URL: https://issues.apache.org/jira/browse/SPARK-28306
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Assignee: Yesheng Ma
>Priority: Major
> Fix For: 3.0.0
>
>
> When the rule NormalizeFloatingNumbers is called multiple times, it will add 
> additional transform operator to an expression, which is not appropriate. To 
> fix it, we have to make it idempotent, i.e. yield the same logical plan 
> regardless of multiple runs.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28237) Idempotence checker for Idempotent batches in RuleExecutors

2019-07-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28237:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-28528

> Idempotence checker for Idempotent batches in RuleExecutors
> ---
>
> Key: SPARK-28237
> URL: https://issues.apache.org/jira/browse/SPARK-28237
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Assignee: Yesheng Ma
>Priority: Major
> Fix For: 3.0.0
>
>
> The current {{RuleExecutor}} system contains two kinds of strategies: 
> {{Once}} and {{FixedPoint}}. The {{Once}} strategy is supposed to run once. 
> However, for particular rules (e.g. PullOutNondeterministic), they are 
> designed to be idempotent, but Spark currently lacks corresponding mechanism 
> to prevent such kind of non-idempotent behavior from happening.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28377) Fully support correlation names in the FROM clause

2019-07-28 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894788#comment-16894788
 ] 

Dongjoon Hyun commented on SPARK-28377:
---

You can increase the priority if you want, [~yumwang].

> Fully support correlation names in the FROM clause
> --
>
> Key: SPARK-28377
> URL: https://issues.apache.org/jira/browse/SPARK-28377
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Specifying a list of column names is not fully support. Example:
> {code:sql}
> create or replace temporary view J1_TBL as select * from
>  (values (1, 4, 'one'), (2, 3, 'two'))
>  as v(i, j, t);
> create or replace temporary view J2_TBL as select * from
>  (values (1, -1), (2, 2))
>  as v(i, k);
> SELECT '' AS xxx, t1.a, t2.e
>   FROM J1_TBL t1 (a, b, c), J2_TBL t2 (d, e)
>   WHERE t1.a = t2.d;
> {code}
> PostgreSQL:
> {noformat}
> postgres=# SELECT '' AS xxx, t1.a, t2.e
> postgres-#   FROM J1_TBL t1 (a, b, c), J2_TBL t2 (d, e)
> postgres-#   WHERE t1.a = t2.d;
>  xxx | a | e
> -+---+
>  | 1 | -1
>  | 2 |  2
> (2 rows)
> {noformat}
> Spark SQL:
> {noformat}
> spark-sql> SELECT '' AS xxx, t1.a, t2.e
>  >   FROM J1_TBL t1 (a, b, c), J2_TBL t2 (d, e)
>  >   WHERE t1.a = t2.d;
> Error in query: cannot resolve '`t1.a`' given input columns: [a, b, c, d, e]; 
> line 3 pos 8;
> 'Project [ AS xxx#21, 't1.a, 't2.e]
> +- 'Filter ('t1.a = 't2.d)
>+- Join Inner
>   :- Project [i#14 AS a#22, j#15 AS b#23, t#16 AS c#24]
>   :  +- SubqueryAlias `t1`
>   : +- SubqueryAlias `j1_tbl`
>   :+- Project [i#14, j#15, t#16]
>   :   +- Project [col1#11 AS i#14, col2#12 AS j#15, col3#13 AS 
> t#16]
>   :  +- SubqueryAlias `v`
>   : +- LocalRelation [col1#11, col2#12, col3#13]
>   +- Project [i#19 AS d#25, k#20 AS e#26]
>  +- SubqueryAlias `t2`
> +- SubqueryAlias `j2_tbl`
>+- Project [i#19, k#20]
>   +- Project [col1#17 AS i#19, col2#18 AS k#20]
>  +- SubqueryAlias `v`
> +- LocalRelation [col1#17, col2#18]
> {noformat}
>  
> *Feature ID*: E051-08
> [https://www.postgresql.org/docs/11/sql-expressions.html]
> [https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/sqlref/src/tpc/db2z_correlationnames.html]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28549) Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils`

2019-07-28 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28549:
--
Description: 
`org.apache.commons.lang3.StringEscapeUtils` is deprecated over two years ago 
at LANG-1316.
{code}
/**
 * Escapes and unescapes {@code String}s for
 * Java, Java Script, HTML and XML.
 *
 * #ThreadSafe#
 * @since 2.0
 * @deprecated as of 3.6, use commons-text
 * https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html;>
 * StringEscapeUtils instead
 */
@Deprecated
public class StringEscapeUtils {
{code}

This issue aims to use the latest one from `commons-text` module which has more 
bug fixes like 
TEXT-100, TEXT-118 and TEXT-120.
{code}
-import org.apache.commons.lang3.StringEscapeUtils
+import org.apache.commons.text.StringEscapeUtils
{code}

This will add a new dependency to `hadoop-2.7` profile distribution.
{code}
+commons-text-1.6.jar
{code}

  was:
`org.apache.commons.lang3.StringEscapeUtils` is deprecated over two years ago 
at LANG-1316.
{code}
/**
 * Escapes and unescapes {@code String}s for
 * Java, Java Script, HTML and XML.
 *
 * #ThreadSafe#
 * @since 2.0
 * @deprecated as of 3.6, use commons-text
 * https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html;>
 * StringEscapeUtils instead
 */
@Deprecated
public class StringEscapeUtils {
{code}

This issue aims to use the latest one from `commons-text` module which has more 
bug fixes like 
TEXT-100, TEXT-118 and TEXT-120.
{code}
-import org.apache.commons.lang3.StringEscapeUtils
+import org.apache.commons.text.StringEscapeUtils
{code}




> Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils`
> --
>
> Key: SPARK-28549
> URL: https://issues.apache.org/jira/browse/SPARK-28549
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> `org.apache.commons.lang3.StringEscapeUtils` is deprecated over two years ago 
> at LANG-1316.
> {code}
> /**
>  * Escapes and unescapes {@code String}s for
>  * Java, Java Script, HTML and XML.
>  *
>  * #ThreadSafe#
>  * @since 2.0
>  * @deprecated as of 3.6, use commons-text
>  *  href="https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html;>
>  * StringEscapeUtils instead
>  */
> @Deprecated
> public class StringEscapeUtils {
> {code}
> This issue aims to use the latest one from `commons-text` module which has 
> more bug fixes like 
> TEXT-100, TEXT-118 and TEXT-120.
> {code}
> -import org.apache.commons.lang3.StringEscapeUtils
> +import org.apache.commons.text.StringEscapeUtils
> {code}
> This will add a new dependency to `hadoop-2.7` profile distribution.
> {code}
> +commons-text-1.6.jar
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28549) Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils`

2019-07-28 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-28549:
-

 Summary: Use `text.StringEscapeUtils` instead 
`lang3.StringEscapeUtils`
 Key: SPARK-28549
 URL: https://issues.apache.org/jira/browse/SPARK-28549
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


`org.apache.commons.lang3.StringEscapeUtils` is deprecated over two years ago 
at LANG-1316.
{code}
/**
 * Escapes and unescapes {@code String}s for
 * Java, Java Script, HTML and XML.
 *
 * #ThreadSafe#
 * @since 2.0
 * @deprecated as of 3.6, use commons-text
 * https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html;>
 * StringEscapeUtils instead
 */
@Deprecated
public class StringEscapeUtils {
{code}

This issue aims to use the latest one from `commons-text` module which has more 
bug fixes like 
TEXT-100, TEXT-118 and TEXT-120.
{code}
-import org.apache.commons.lang3.StringEscapeUtils
+import org.apache.commons.text.StringEscapeUtils
{code}





--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28549) Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils`

2019-07-28 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28549:
--
Component/s: Build

> Use `text.StringEscapeUtils` instead `lang3.StringEscapeUtils`
> --
>
> Key: SPARK-28549
> URL: https://issues.apache.org/jira/browse/SPARK-28549
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> `org.apache.commons.lang3.StringEscapeUtils` is deprecated over two years ago 
> at LANG-1316.
> {code}
> /**
>  * Escapes and unescapes {@code String}s for
>  * Java, Java Script, HTML and XML.
>  *
>  * #ThreadSafe#
>  * @since 2.0
>  * @deprecated as of 3.6, use commons-text
>  *  href="https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html;>
>  * StringEscapeUtils instead
>  */
> @Deprecated
> public class StringEscapeUtils {
> {code}
> This issue aims to use the latest one from `commons-text` module which has 
> more bug fixes like 
> TEXT-100, TEXT-118 and TEXT-120.
> {code}
> -import org.apache.commons.lang3.StringEscapeUtils
> +import org.apache.commons.text.StringEscapeUtils
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28548) explain() shows wrong result for persisted DataFrames after some operations

2019-07-28 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-28548:
--

 Summary: explain() shows wrong result for persisted DataFrames 
after some operations
 Key: SPARK-28548
 URL: https://issues.apache.org/jira/browse/SPARK-28548
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


After some operations against Datasets and then persist them, Dataset.explain 
shows wrong result.
One of those operations is explain() itself.
An example here.

{code}
val df = spark.range(10)
df.explain
df.persist
df.explain
{code}

Expected result is like as follows.
{code}
== Physical Plan ==
*(1) ColumnarToRow
+- InMemoryTableScan [id#7L]
  +- InMemoryRelation [id#7L], StorageLevel(disk, memory, deserialized, 1 
replicas)
+- *(1) Range (0, 10, step=1, splits=12)
{code}

But I got this.
{code}
== Physical Plan ==
*(1) Range (0, 10, step=1, splits=12)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28036) Built-in udf left/right has inconsistent behavior

2019-07-28 Thread ShuMing Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877765#comment-16877765
 ] 

ShuMing Li edited comment on SPARK-28036 at 7/28/19 2:28 PM:
-

[~shivuson...@gmail.com] It's not the same with `Postgres`:
 * In Postgres:
 ** left({{str}} {{text}}, {{n}} {{int}}): Return first {{n}} characters in the 
string. When {{n is negative, return all but last |n}}| characters;
 ** right({{str}} {{text}}, {{n}} {{int}}): Return last {{n}} characters in the 
string. When {{n}} is negative, return all but first |{{n}}| characters;
 * In Spark:
 ** left(str, len) - Returns the leftmost `len`(`len` can be string type) 
characters from the string `str`,if `len` is less or equal than 0 the result is 
an empty string;
 ** right(str, len) - Returns the rightmost `len`(`len` can be string type) 
characters from the string `str`,if `len` is less or equal than 0 the result is 
an empty string. 

They are different when `n`/`len` is negative. So maybe need to change Spark to 
adapt  to Postgres's meaning.


was (Author: lishuming):
[~shivuson...@gmail.com] It's not the same with `Postgres`:
 * In Postgres:
 ** left({{str}} {{text}}, {{n}} {{int}}): Return first {{n}} characters in the 
string. When {{n }}is negative, return all but last |{{n}}| characters;
 ** right({{str}} {{text}}, {{n}} {{int}}): Return last {{n}} characters in the 
string. When {{n}} is negative, return all but first |{{n}}| characters;
 * In Spark:
 ** left(str, len) - Returns the leftmost `len`(`len` can be string type) 
characters from the string `str`,if `len` is less or equal than 0 the result is 
an empty string;
 ** right(str, len) - Returns the rightmost `len`(`len` can be string type) 
characters from the string `str`,if `len` is less or equal than 0 the result is 
an empty string. 

They are different when `n`/`len` is negative. So maybe need to change Spark to 
adapt  to Postgres's meaning.

> Built-in udf left/right has inconsistent behavior
> -
>
> Key: SPARK-28036
> URL: https://issues.apache.org/jira/browse/SPARK-28036
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> PostgreSQL:
> {code:sql}
> postgres=# select left('ahoj', -2), right('ahoj', -2);
>  left | right 
> --+---
>  ah   | oj
> (1 row)
> {code}
> Spark SQL:
> {code:sql}
> spark-sql> select left('ahoj', -2), right('ahoj', -2);
> spark-sql>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21481) Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF

2019-07-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21481:
-

Assignee: Huaxin Gao

> Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF
> -
>
> Key: SPARK-21481
> URL: https://issues.apache.org/jira/browse/SPARK-21481
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Aseem Bansal
>Assignee: Huaxin Gao
>Priority: Major
>
> If we want to find the index of any input based on hashing trick then it is 
> possible in 
> https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.mllib.feature.HashingTF
>  but not in 
> https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.feature.HashingTF.
> Should allow that for feature parity



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21481) Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF

2019-07-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21481.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25250
[https://github.com/apache/spark/pull/25250]

> Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF
> -
>
> Key: SPARK-21481
> URL: https://issues.apache.org/jira/browse/SPARK-21481
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Aseem Bansal
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> If we want to find the index of any input based on hashing trick then it is 
> possible in 
> https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.mllib.feature.HashingTF
>  but not in 
> https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.feature.HashingTF.
> Should allow that for feature parity



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-07-28 Thread antonkulaga (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

antonkulaga updated SPARK-28547:

Description: 
Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
rows). Most of the genomics/transcriptomic data is wide because number of genes 
is usually >20kb and number of samples ass well. Very popular GTEX dataset is a 
good example ( see for instance RNA-Seq data at  
https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just 
a .tsv file with two comments in the beginning). Everything done in wide tables 
(even simple "describe" functions applied to all the genes-columns) either 
takes ours or gets frozen (because of lost executors) irrespective of memory 
and numbers of cores. While the same operations work well with pure pandas 
(without any spark involved).
f

  was:
Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
rows). Most of the genomics/transcriptomic data is wide because number of genes 
is usually >20kb and number of samples ass well. Very popular GTEX dataset is a 
good example ( see for instance RNA-Seq data at  
https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just 
a .tsv file with two comments in the beginning). Everything done in wide tables 
either takes ours or gets frozen (because of lost executors) irrespective of 
memory and numbers of cores. While the same operations work well with pure 
pandas (without any spark involved).
f


> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.3
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes ours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work well with pure 
> pandas (without any spark involved).
> f



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-07-28 Thread antonkulaga (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

antonkulaga updated SPARK-28547:

Description: 
Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
rows). Most of the genomics/transcriptomic data is wide because number of genes 
is usually >20kb and number of samples ass well. Very popular GTEX dataset is a 
good example ( see for instance RNA-Seq data at  
https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just 
a .tsv file with two comments in the beginning). Everything done in wide tables 
(even simple "describe" functions applied to all the genes-columns) either 
takes hours or gets frozen (because of lost executors) irrespective of memory 
and numbers of cores. While the same operations work fast (minutes) and well 
with pure pandas (without any spark involved).
f

  was:
Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
rows). Most of the genomics/transcriptomic data is wide because number of genes 
is usually >20kb and number of samples ass well. Very popular GTEX dataset is a 
good example ( see for instance RNA-Seq data at  
https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just 
a .tsv file with two comments in the beginning). Everything done in wide tables 
(even simple "describe" functions applied to all the genes-columns) either 
takes ours or gets frozen (because of lost executors) irrespective of memory 
and numbers of cores. While the same operations work well with pure pandas 
(without any spark involved).
f


> Make it work for wide (> 10K columns data)
> --
>
> Key: SPARK-28547
> URL: https://issues.apache.org/jira/browse/SPARK-28547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4, 2.4.3
> Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>Reporter: antonkulaga
>Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28547) Make it work for wide (> 10K columns data)

2019-07-28 Thread antonkulaga (JIRA)
antonkulaga created SPARK-28547:
---

 Summary: Make it work for wide (> 10K columns data)
 Key: SPARK-28547
 URL: https://issues.apache.org/jira/browse/SPARK-28547
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.3, 2.4.4
 Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per node, 
32 cores (tried different configurations of executors)
Reporter: antonkulaga


Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
rows). Most of the genomics/transcriptomic data is wide because number of genes 
is usually >20kb and number of samples ass well. Very popular GTEX dataset is a 
good example ( see for instance RNA-Seq data at  
https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just 
a .tsv file with two comments in the beginning). Everything done in wide tables 
either takes ours or gets frozen (because of lost executors) irrespective of 
memory and numbers of cores. While the same operations work well with pure 
pandas (without any spark involved).
f



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org