date:20180918

[jira] [Updated] (SPARK-25462) hive on spark - got a weird output when count(*) from this script

2018-09-18 Thread Gu Yuchen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gu Yuchen updated SPARK-25462:
--
Environment: 
spark 1.6.2

hive 1.2.2

hadoop 2.7.1

  was:
spark 1.6.1

hive 1.2.2

hadoop 2.7.1


> hive on spark - got a weird output when count(*)  from this script
> --
>
> Key: SPARK-25462
> URL: https://issues.apache.org/jira/browse/SPARK-25462
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.6.2
> Environment: spark 1.6.2
> hive 1.2.2
> hadoop 2.7.1
>Reporter: Gu Yuchen
>Priority: Major
> Attachments: jira.png, test.gz.parquet
>
>
>  
> use hiveContext to exec a script below:
> with nt as (select label, score from (select * from (select label, score, 
> row_number() over (order by score desc) as position from t1)t_1 join (select 
> count(*) as countall from t1)t_2 )ta where position <= countall * 0.4) select 
> count(*) as c_positive from nt where label = 1
> and i got this result.
> !jira.png!
> it is weird when call the 'count()' func on rdd and dataframe,
> as the pic says: different output here
> can someone help me out? thanks a lot
>  
> PS: the parquet file i used is the 'test.gz.parquet' in Attachments.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-25462) hive on spark - got a weird output when count(*) from this script

2018-09-18 Thread Gu Yuchen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gu Yuchen reopened SPARK-25462:
---

please help me out with this.
thanks a lot 

> hive on spark - got a weird output when count(*)  from this script
> --
>
> Key: SPARK-25462
> URL: https://issues.apache.org/jira/browse/SPARK-25462
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.6.2
> Environment: spark 1.6.1
> hive 1.2.2
> hadoop 2.7.1
>Reporter: Gu Yuchen
>Priority: Major
> Attachments: jira.png, test.gz.parquet
>
>
>  
> use hiveContext to exec a script below:
> with nt as (select label, score from (select * from (select label, score, 
> row_number() over (order by score desc) as position from t1)t_1 join (select 
> count(*) as countall from t1)t_2 )ta where position <= countall * 0.4) select 
> count(*) as c_positive from nt where label = 1
> and i got this result.
> !jira.png!
> it is weird when call the 'count()' func on rdd and dataframe,
> as the pic says: different output here
> can someone help me out? thanks a lot
>  
> PS: the parquet file i used is the 'test.gz.parquet' in Attachments.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25454) Division between operands with negative scale can cause precision loss

2018-09-18 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620192#comment-16620192
 ] 

Marco Gaido commented on SPARK-25454:
-

[~bersprockets] you're right, the only "wrong" thing of your statements is that 
the problem is not about 1000 but about 1e6, which in 2.2 was considered a 
decimal(10, 0) and now it is parsed as a decimal(1, -6). You could reproduce 
the same issue using {{lit(BigDecimal(1e6))}} in 2.2. So the problem is that we 
are not handling properly decimals with negative scale, but we are not 
forbidding their existence either, hence the issue. Making more common the 
presence of negative scale numbers made the issue more evident. Hope this is 
clear. Thanks.

> Division between operands with negative scale can cause precision loss
> --
>
> Key: SPARK-25454
> URL: https://issues.apache.org/jira/browse/SPARK-25454
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Marco Gaido
>Priority: Major
>
> The issue was originally reported by [~bersprockets] here: 
> https://issues.apache.org/jira/browse/SPARK-22036?focusedCommentId=16618104&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16618104.
> The problem consist in a precision loss when the second operand of the 
> division is a decimal with a negative scale. It was present also before 2.3 
> but it was harder to reproduce: you had to do something like 
> {{lit(BigDecimal(100e6))}}, while now this can happen more frequently with 
> SQL constants.
> The problem is that our logic is taken from Hive and SQLServer where decimals 
> with negative scales are not allowed. We might also consider enforcing this 
> too in 3.0 eventually. Meanwhile we can fix the logic for computing the 
> result type for a division.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Moved] (SPARK-25462) hive on spark - got a weird output when count(*) from this script

2018-09-18 Thread Gu Yuchen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gu Yuchen moved HIVE-20592 to SPARK-25462:
--

 Shepherd: Jeremy
Affects Version/s: 1.6.2
  Component/s: SQL
 Workflow: no-reopen-closed  (was: no-reopen-closed, patch-avail)
   Issue Type: Question  (was: Bug)
  Key: SPARK-25462  (was: HIVE-20592)
  Project: Spark  (was: Hive)

> hive on spark - got a weird output when count(*)  from this script
> --
>
> Key: SPARK-25462
> URL: https://issues.apache.org/jira/browse/SPARK-25462
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.6.2
> Environment: spark 1.6.1
> hive 1.2.2
> hadoop 2.7.1
>Reporter: Gu Yuchen
>Priority: Major
> Attachments: jira.png, test.gz.parquet
>
>
>  
> use hiveContext to exec a script below:
> with nt as (select label, score from (select * from (select label, score, 
> row_number() over (order by score desc) as position from t1)t_1 join (select 
> count(*) as countall from t1)t_2 )ta where position <= countall * 0.4) select 
> count(*) as c_positive from nt where label = 1
> and i got this result.
> !jira.png!
> it is weird when call the 'count()' func on rdd and dataframe,
> as the pic says: different output here
> can someone help me out? thanks a lot
>  
> PS: the parquet file i used is the 'test.gz.parquet' in Attachments.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25452) Query with where clause is giving unexpected result in case of float column

2018-09-18 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620156#comment-16620156
 ] 

Hyukjin Kwon commented on SPARK-25452:
--

Thanks. I will appreciate if this can be identified as a duplicate or not.

> Query with where clause is giving unexpected result in case of float column
> ---
>
> Key: SPARK-25452
> URL: https://issues.apache.org/jira/browse/SPARK-25452
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: *Spark 2.3.1*
> *Hadoop 2.7.2*
>Reporter: Ayush Anubhava
>Priority: Major
>
> *Description* : Query with clause is giving unexpected result in case of 
> float column 
>  
> {color:#d04437}*Query with filter less than equal to is giving inappropriate 
> result{code}*{color}
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (0,0.0);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (1,1.1);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
> +++--+
> | a | b |
> +++--+
> | 0 | 0.0 |
> | 1 | 1.10023841858 |
> +++--+
> Query with filter less than equal to is giving in appropriate result
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
> ++--+--+
> | a | b |
> ++--+--+
> | 0 | 0.0 |
> ++--+--+
> 1 row selected (0.299 seconds)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25460) DataSourceV2: Structured Streaming does not respect SessionConfigSupport

2018-09-18 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620138#comment-16620138
 ] 

Hyukjin Kwon commented on SPARK-25460:
--

PR https://github.com/apache/spark/pull/22462

> DataSourceV2: Structured Streaming does not respect SessionConfigSupport
> 
>
> Key: SPARK-25460
> URL: https://issues.apache.org/jira/browse/SPARK-25460
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {{SessionConfigSupport}} allows to support configurations as options:
> {code}
> `spark.datasource.$keyPrefix.xxx` into `xxx`, 
> {code}
> Currently, structured streaming does seem supporting this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23200) Reset configuration when restarting from checkpoints

2018-09-18 Thread Yinan Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yinan Li resolved SPARK-23200.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22392
[https://github.com/apache/spark/pull/22392]

> Reset configuration when restarting from checkpoints
> 
>
> Key: SPARK-23200
> URL: https://issues.apache.org/jira/browse/SPARK-23200
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
> Fix For: 2.4.0
>
>
> Streaming workloads and restarting from checkpoints may need additional 
> changes, i.e. resetting properties -  see 
> https://github.com/apache-spark-on-k8s/spark/pull/516



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None

2018-09-18 Thread Chongyuan Xiang (JIRA)

Chongyuan Xiang created SPARK-25461:
---

 Summary: PySpark Pandas UDF outputs incorrect results when input 
columns contain None
 Key: SPARK-25461
 URL: https://issues.apache.org/jira/browse/SPARK-25461
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.1
 Environment: I reproduced this issue by running pyspark locally on mac:

Spark version: 2.3.1 pre-built with Hadoop 2.7

Python library versions: pyarrow==0.10.0, pandas==0.20.2
Reporter: Chongyuan Xiang


The following PySpark script uses a simple pandas UDF to calculate a column 
given column 'A'. When column 'A' contains None, the results look incorrect.

Script: 

 
{code:java}
import pandas as pd
import random
import pyspark
from pyspark.sql.functions import col, lit, pandas_udf

values = [None] * 3 + [1.0] * 17 + [2.0] * 600
random.shuffle(values)
pdf = pd.DataFrame({'A': values})
df = spark.createDataFrame(pdf)

@pandas_udf(returnType=pyspark.sql.types.BooleanType())
def gt_2(column):
return (column >= 2).where(column.notnull())

calculated_df = (df.select(['A'])
.withColumn('potential_bad_col', gt_2('A'))
)

calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) | 
(col("A").isNull()))

calculated_df.show()
{code}
 

Output:
{code:java}
+---+-+---+
| A|potential_bad_col|correct_col|
+---+-+---+
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|1.0| false| false|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
|2.0| false| true|
+---+-+---+
only showing top 20 rows
{code}
This problem disappears when the number of rows is small or when the input 
column does not contain None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25460) DataSourceV2: Structured Streaming does not respect SessionConfigSupport

2018-09-18 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-25460:


 Summary: DataSourceV2: Structured Streaming does not respect 
SessionConfigSupport
 Key: SPARK-25460
 URL: https://issues.apache.org/jira/browse/SPARK-25460
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Structured Streaming
Affects Versions: 2.4.0
Reporter: Hyukjin Kwon


{{SessionConfigSupport}} allows to support configurations as options:

{code}
`spark.datasource.$keyPrefix.xxx` into `xxx`, 
{code}

Currently, structured streaming does seem supporting this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25453) OracleIntegrationSuite IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]

2018-09-18 Thread Chenxiao Mao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620058#comment-16620058
 ] 

Chenxiao Mao commented on SPARK-25453:
--

User 'seancxmao' has created a pull request for this issue:
[https://github.com/apache/spark/pull/22461]

> OracleIntegrationSuite IllegalArgumentException: Timestamp format must be 
> -mm-dd hh:mm:ss[.f]
> -
>
> Key: SPARK-25453
> URL: https://issues.apache.org/jira/browse/SPARK-25453
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> - SPARK-22814 support date/timestamp types in partitionColumn *** FAILED ***
>   java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
>   at java.sql.Timestamp.valueOf(Timestamp.java:204)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.toInternalBoundValue(JDBCRelation.scala:183)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.columnPartition(JDBCRelation.scala:88)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:36)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
>   at 
> org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:445)
>   at 
> org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:427)
>   ...{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25459) Add viewOriginalText back to CatalogTable

2018-09-18 Thread Zheyuan Zhao (JIRA)

Zheyuan Zhao created SPARK-25459:


 Summary: Add viewOriginalText back to CatalogTable
 Key: SPARK-25459
 URL: https://issues.apache.org/jira/browse/SPARK-25459
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1, 2.3.0, 2.2.2, 2.2.1, 2.2.0
Reporter: Zheyuan Zhao


The {{show create table}} will show a lot of generated attributes for views 
that created by older Spark version. See this test suite 
https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSQLViewSuite.scala#L115.
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19724) create a managed table with an existed default location should throw an exception

2018-09-18 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-19724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-19724.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 2.4.0

> create a managed table with an existed default location should throw an 
> exception
> -
>
> Key: SPARK-19724
> URL: https://issues.apache.org/jira/browse/SPARK-19724
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> This JIRA is a follow up work after 
> [SPARK-19583](https://issues.apache.org/jira/browse/SPARK-19583)
> As we discussed in that [PR](https://github.com/apache/spark/pull/16938)
> The following DDL for a managed table with an existed default location should 
> throw an exception:
> {code}
> CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
> CREATE TABLE ... (PARTITIONED BY ...)
> {code}
> Currently there are some situations which are not consist with above logic:
> 1. CREATE TABLE ... (PARTITIONED BY ...) succeed with an existed default 
> location
> situation: for both hive/datasource(with HiveExternalCatalog/InMemoryCatalog)
> 2. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
> situation: hive table succeed with an existed default location



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25454) Division between operands with negative scale can cause precision loss

2018-09-18 Thread Bruce Robbins (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619911#comment-16619911
 ] 

Bruce Robbins commented on SPARK-25454:
---

Thanks [~mgaido],

OK, so the way I understand it
- negative scales are the problem
- they were also a problem in 2.2, but it was more difficult to reproduce. For 
my example case (in the referenced Jira), the change to the promotion of 
literal 1000 in 2.3 exposed an existing issue with the handling of 1e6.

This bears out, in that when I replace 1e6 with 100, the issue goes away 
(at least for my example case):
{noformat}
scala> sql("select 26393499451/(100 * 1000) as c1").show
++
|  c1|
++
|26.393499451|
++
{noformat}


> Division between operands with negative scale can cause precision loss
> --
>
> Key: SPARK-25454
> URL: https://issues.apache.org/jira/browse/SPARK-25454
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Marco Gaido
>Priority: Major
>
> The issue was originally reported by [~bersprockets] here: 
> https://issues.apache.org/jira/browse/SPARK-22036?focusedCommentId=16618104&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16618104.
> The problem consist in a precision loss when the second operand of the 
> division is a decimal with a negative scale. It was present also before 2.3 
> but it was harder to reproduce: you had to do something like 
> {{lit(BigDecimal(100e6))}}, while now this can happen more frequently with 
> SQL constants.
> The problem is that our logic is taken from Hive and SQLServer where decimals 
> with negative scales are not allowed. We might also consider enforcing this 
> too in 3.0 eventually. Meanwhile we can fix the logic for computing the 
> result type for a division.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24572) "eager execution" for R shell, IDE

2018-09-18 Thread Weiqiang Zhuang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619872#comment-16619872
 ] 

Weiqiang Zhuang commented on SPARK-24572:
-

thanks [~felixcheung], raised PR https://github.com/apache/spark/pull/22455.

> "eager execution" for R shell, IDE
> --
>
> Key: SPARK-24572
> URL: https://issues.apache.org/jira/browse/SPARK-24572
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Felix Cheung
>Priority: Major
>
> like python in SPARK-24215
> we could also have eager execution when SparkDataFrame is returned to the R 
> shell



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24626) Parallelize size calculation in Analyze Table command

2018-09-18 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-24626.
-
   Resolution: Fixed
 Assignee: Reynold Xin
Fix Version/s: 2.4.0

> Parallelize size calculation in Analyze Table command
> -
>
> Key: SPARK-24626
> URL: https://issues.apache.org/jira/browse/SPARK-24626
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Achuth Narayan Rajagopal
>Assignee: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, Analyze table calculates table size sequentially for each 
> partition. We can parallelize size calculations over partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24626) Parallelize size calculation in Analyze Table command

2018-09-18 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-24626:

Summary: Parallelize size calculation in Analyze Table command  (was: 
Improve Analyze Table command)

> Parallelize size calculation in Analyze Table command
> -
>
> Key: SPARK-24626
> URL: https://issues.apache.org/jira/browse/SPARK-24626
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Achuth Narayan Rajagopal
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, Analyze table calculates table size sequentially for each 
> partition. We can parallelize size calculations over partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25419) Parquet predicate pushdown improvement

2018-09-18 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25419:
---

Assignee: Yuming Wang

> Parquet predicate pushdown improvement
> --
>
> Key: SPARK-25419
> URL: https://issues.apache.org/jira/browse/SPARK-25419
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Parquet predicate pushdown support: ByteType, ShortType, DecimalType, 
> DateType, TimestampType. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25456) PythonForeachWriterSuite failing

2018-09-18 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-25456:


Assignee: Imran Rashid

> PythonForeachWriterSuite failing
> 
>
> Key: SPARK-25456
> URL: https://issues.apache.org/jira/browse/SPARK-25456
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Blocker
> Fix For: 2.4.0
>
>
> This is failing regularly, see eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96164/testReport/junit/org.apache.spark.sql.execution.python/PythonForeachWriterSuite/UnsafeRowBuffer__iterator_blocks_when_no_data_is_available/
> I will post a fix shortly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25456) PythonForeachWriterSuite failing

2018-09-18 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-25456.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22452
[https://github.com/apache/spark/pull/22452]

> PythonForeachWriterSuite failing
> 
>
> Key: SPARK-25456
> URL: https://issues.apache.org/jira/browse/SPARK-25456
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Blocker
> Fix For: 2.4.0
>
>
> This is failing regularly, see eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96164/testReport/junit/org.apache.spark.sql.execution.python/PythonForeachWriterSuite/UnsafeRowBuffer__iterator_blocks_when_no_data_is_available/
> I will post a fix shortly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25291) Flakiness of tests in terms of executor memory (SecretsTestSuite)

2018-09-18 Thread Yinan Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yinan Li resolved SPARK-25291.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

> Flakiness of tests in terms of executor memory (SecretsTestSuite)
> -
>
> Key: SPARK-25291
> URL: https://issues.apache.org/jira/browse/SPARK-25291
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Ilan Filonenko
>Priority: Major
> Fix For: 2.4.0
>
>
> SecretsTestSuite shows flakiness in terms of correct setting of executor 
> memory: 
> Run SparkPi with env and mount secrets. *** FAILED ***
>  "[884]Mi" did not equal "[1408]Mi" (KubernetesSuite.scala:272)
> When ran with default settings 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25458) Support FOR ALL COLUMNS in ANALYZE TABLE

2018-09-18 Thread Dilip Biswal (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619310#comment-16619310
 ] 

Dilip Biswal commented on SPARK-25458:
--

[~smilegator] I would like to work on this.

> Support FOR ALL COLUMNS in ANALYZE TABLE 
> -
>
> Key: SPARK-25458
> URL: https://issues.apache.org/jira/browse/SPARK-25458
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, to collect the statistics of all the columns, users need to 
> specify the names of all the columns when calling the command "ANALYZE TABLE 
> ... FOR COLUMNS...". This is not user friendly. Instead, we can introduce the 
> following SQL command to achieve it without specifying the column names.
> {code:java}
>ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR ALL COLUMNS;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25458) Support FOR ALL COLUMNS in ANALYZE TABLE

2018-09-18 Thread Xiao Li (JIRA)

Xiao Li created SPARK-25458:
---

 Summary: Support FOR ALL COLUMNS in ANALYZE TABLE 
 Key: SPARK-25458
 URL: https://issues.apache.org/jira/browse/SPARK-25458
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.5.0
Reporter: Xiao Li


Currently, to collect the statistics of all the columns, users need to specify 
the names of all the columns when calling the command "ANALYZE TABLE ... FOR 
COLUMNS...". This is not user friendly. Instead, we can introduce the following 
SQL command to achieve it without specifying the column names.
{code:java}
   ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR ALL COLUMNS;
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions

2018-09-18 Thread Deepanker (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619243#comment-16619243
 ] 

Deepanker commented on SPARK-18185:
---

Tested this even further and found out that it works for managed tables as 
well. But not for tables created via saveAsTable API of spark. 

As a test i did the following:
saveAsTable [Stored as ORC ] [Partitioned by arbitrary column] [Doesn't work 
for this]

Create Table like x [Using Beeline CLI] [Same properties as above table] [Works 
for this]

Create external Table like x [Using Beeline CLI] [Same properties as above 
table] [Works for this]

Is this the expected behaviour? I am using Spark 2.2 and Hive 1.1

> Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
> --
>
> Key: SPARK-18185
> URL: https://issues.apache.org/jira/browse/SPARK-18185
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Major
> Fix For: 2.1.0
>
>
> As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
> Datasource table will overwrite the entire table instead of only the updated 
> partitions as in Hive. It also doesn't respect custom partition locations.
> We should delete only the proper partitions, scan the metastore for affected 
> partitions with custom locations, and ensure that deletes/writes go to the 
> right locations for those as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20236) Overwrite a partitioned data source table should only overwrite related partitions

2018-09-18 Thread Deepanker (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619236#comment-16619236
 ] 

Deepanker edited comment on SPARK-20236 at 9/18/18 3:07 PM:


Hi Wenchen Fan,

Tested this even further and found out that it works for managed tables as 
well. But not for tables created via saveAsTable API of spark. 
 As a test i did the following:
 saveAsTable [Stored as ORC ] [Partitioned by arbitrary column] [Doesn't work 
for this]

Create Table like x [Using Beeline CLI] [Same properties as above table] [Works 
for this]

Create external Table like x [Using Beeline CLI] [Same properties as above 
table] [Works for this]

Is this the expected behaviour? I am using Spark 2.2 and Hive 1.1

If/When find time can you also confirm my hypothesis for the difference between 
two Jira in my previous post?


was (Author: deepanker):
Hi Wenchen Fan,

Tested this even further and found out that it works for managed tables as 
well. But not for tables created via saveAsTable API of spark. 
As a test i did the following:
saveAsTable (x) [Stored as ORC ]partitioned by arbitrary column] [Stored as ORC 
] [Doesn't work for this]


Create Table like x [Using Beeline CLI] [Same properties as above table] [Works 
for this]

Create external Table like x [Using Beeline CLI] [Same properties as above 
table] [Works for this]

> Overwrite a partitioned data source table should only overwrite related 
> partitions
> --
>
> Key: SPARK-20236
> URL: https://issues.apache.org/jira/browse/SPARK-20236
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> When we overwrite a partitioned data source table, currently Spark will 
> truncate the entire table to write new data, or truncate a bunch of 
> partitions according to the given static partitions.
> For example, {{INSERT OVERWRITE tbl ...}} will truncate the entire table, 
> {{INSERT OVERWRITE tbl PARTITION (a=1, b)}} will truncate all the partitions 
> that starts with {{a=1}}.
> This behavior is kind of reasonable as we can know which partitions will be 
> overwritten before runtime. However, hive has a different behavior that it 
> only overwrites related partitions, e.g. {{INSERT OVERWRITE tbl SELECT 
> 1,2,3}} will only overwrite partition {{a=2, b=3}}, assuming {{tbl}} has only 
> one data column and is partitioned by {{a}} and {{b}}.
> It seems better if we can follow hive's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20236) Overwrite a partitioned data source table should only overwrite related partitions

2018-09-18 Thread Deepanker (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619236#comment-16619236
 ] 

Deepanker commented on SPARK-20236:
---

Hi Wenchen Fan,

Tested this even further and found out that it works for managed tables as 
well. But not for tables created via saveAsTable API of spark. 
As a test i did the following:
saveAsTable (x) [Stored as ORC ]partitioned by arbitrary column] [Stored as ORC 
] [Doesn't work for this]


Create Table like x [Using Beeline CLI] [Same properties as above table] [Works 
for this]

Create external Table like x [Using Beeline CLI] [Same properties as above 
table] [Works for this]

> Overwrite a partitioned data source table should only overwrite related 
> partitions
> --
>
> Key: SPARK-20236
> URL: https://issues.apache.org/jira/browse/SPARK-20236
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> When we overwrite a partitioned data source table, currently Spark will 
> truncate the entire table to write new data, or truncate a bunch of 
> partitions according to the given static partitions.
> For example, {{INSERT OVERWRITE tbl ...}} will truncate the entire table, 
> {{INSERT OVERWRITE tbl PARTITION (a=1, b)}} will truncate all the partitions 
> that starts with {{a=1}}.
> This behavior is kind of reasonable as we can know which partitions will be 
> overwritten before runtime. However, hive has a different behavior that it 
> only overwrites related partitions, e.g. {{INSERT OVERWRITE tbl SELECT 
> 1,2,3}} will only overwrite partition {{a=2, b=3}}, assuming {{tbl}} has only 
> one data column and is partitioned by {{a}} and {{b}}.
> It seems better if we can follow hive's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25445) publish a scala 2.12 build with Spark 2.4

2018-09-18 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25445.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22441
[https://github.com/apache/spark/pull/22441]

> publish a scala 2.12 build with Spark 2.4
> -
>
> Key: SPARK-25445
> URL: https://issues.apache.org/jira/browse/SPARK-25445
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25457) IntegralDivide (div) should not always return long

2018-09-18 Thread Marco Gaido (JIRA)

Marco Gaido created SPARK-25457:
---

 Summary: IntegralDivide (div) should not always return long
 Key: SPARK-25457
 URL: https://issues.apache.org/jira/browse/SPARK-25457
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.5.0
Reporter: Marco Gaido


The operation {{div}} returns always long. This came from Hive's behavior, 
which is different to the  one of most of other DBMS (eg. MySQL, Postgres) 
which return as a datatype the same of the operands.

This JIRA tracks changing our return type and allowing the users to re-enable 
the old behavior using {{spark.sql.legacy.integralDiv.returnLong}}.

I'll submit a PR for this soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20236) Overwrite a partitioned data source table should only overwrite related partitions

2018-09-18 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619180#comment-16619180
 ] 

Wenchen Fan commented on SPARK-20236:
-

It should work for managed table as well. Can you open a JIRA and report the 
issues for managed table?

> Overwrite a partitioned data source table should only overwrite related 
> partitions
> --
>
> Key: SPARK-20236
> URL: https://issues.apache.org/jira/browse/SPARK-20236
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> When we overwrite a partitioned data source table, currently Spark will 
> truncate the entire table to write new data, or truncate a bunch of 
> partitions according to the given static partitions.
> For example, {{INSERT OVERWRITE tbl ...}} will truncate the entire table, 
> {{INSERT OVERWRITE tbl PARTITION (a=1, b)}} will truncate all the partitions 
> that starts with {{a=1}}.
> This behavior is kind of reasonable as we can know which partitions will be 
> overwritten before runtime. However, hive has a different behavior that it 
> only overwrites related partitions, e.g. {{INSERT OVERWRITE tbl SELECT 
> 1,2,3}} will only overwrite partition {{a=2, b=3}}, assuming {{tbl}} has only 
> one data column and is partitioned by {{a}} and {{b}}.
> It seems better if we can follow hive's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24777) Add write benchmark for AVRO

2018-09-18 Thread Gengliang Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-24777:
---
Summary: Add write benchmark for AVRO  (was: Refactor AVRO read/write 
benchmark)

> Add write benchmark for AVRO
> 
>
> Key: SPARK-24777
> URL: https://issues.apache.org/jira/browse/SPARK-24777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25453) OracleIntegrationSuite IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]

2018-09-18 Thread Chenxiao Mao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619165#comment-16619165
 ] 

Chenxiao Mao commented on SPARK-25453:
--

I'm working on this. cc [~maropu] [~yumwang]

> OracleIntegrationSuite IllegalArgumentException: Timestamp format must be 
> -mm-dd hh:mm:ss[.f]
> -
>
> Key: SPARK-25453
> URL: https://issues.apache.org/jira/browse/SPARK-25453
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> - SPARK-22814 support date/timestamp types in partitionColumn *** FAILED ***
>   java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
>   at java.sql.Timestamp.valueOf(Timestamp.java:204)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.toInternalBoundValue(JDBCRelation.scala:183)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.columnPartition(JDBCRelation.scala:88)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:36)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
>   at 
> org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:445)
>   at 
> org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:427)
>   ...{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25456) PythonForeachWriterSuite failing

2018-09-18 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-25456:


 Summary: PythonForeachWriterSuite failing
 Key: SPARK-25456
 URL: https://issues.apache.org/jira/browse/SPARK-25456
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.4.0
Reporter: Imran Rashid


This is failing regularly, see eg. 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96164/testReport/junit/org.apache.spark.sql.execution.python/PythonForeachWriterSuite/UnsafeRowBuffer__iterator_blocks_when_no_data_is_available/

I will post a fix shortly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25455) Spark bundles jackson library version, which is vulnerable

2018-09-18 Thread Madhusudan N (JIRA)

Madhusudan N created SPARK-25455:


 Summary: Spark bundles jackson library version, which is 
vulnerable 
 Key: SPARK-25455
 URL: https://issues.apache.org/jira/browse/SPARK-25455
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1, 2.2.0
Reporter: Madhusudan N


We have hosted one of our application in SPARK standalone mode and the 
application has the below jackson library dependencies.

Version = 2.9.6
 * jackson-core
 * jackson-databind
 * jackson-dataformat-cbor
 * jackson-dataformat-xml
 * jackson-dataformat-yaml

 

 Due to a vulnerability with jackson 2.6.6 as indicated by the Veracode, it has 
been upgraded to 2.9.6 version.

Please find the link which depicts the vulnerability issue with jackson 2.6.6.
[http://cwe.mitre.org/data/definitions/470.html]
 
Spark version (2.2.0 and 2.3.1) has dependency with jackson-core 2.6.5 and 
jackson-core-2.6.7, but our application needs jackson-core 2.9.6. Because of 
this, application crashes. Please find the stacktrace below ::

{{_Exception in thread "main" [Loaded java.lang.Throwable$WrappedPrintStream 
from 
/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar]_}}{{_java.lang.NoSuchFieldError:
 NO_INTS_}}{{    __    }}

{{_at 
com.fasterxml.jackson.dataformat.cbor.CBORParser.(CBORParser.java:285)_}}{{
    __    }}

{{_at 
com.fasterxml.jackson.dataformat.cbor.CBORParserBootstrapper.constructParser(CBORParserBootstrapper.java:91)_}}{{
    __    }}

{{_at 
com.fasterxml.jackson.dataformat.cbor.CBORFactory._createParser(CBORFactory.java:377)_}}

 

Spark needs to use jackson-core-2.9.6 version., which does not have the 
vulnerability

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22036) BigDecimal multiplication sometimes returns null

2018-09-18 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619129#comment-16619129
 ] 

Marco Gaido commented on SPARK-22036:
-

[~bersprockets] I created SPARK-25454 for tracking since I have a path for this 
and it might be considered as a blocker for 2.4, so I wanted to expedite it. I 
am submitting a patch for this soon. Sorry for the problem again. Thanks.

> BigDecimal multiplication sometimes returns null
> 
>
> Key: SPARK-22036
> URL: https://issues.apache.org/jira/browse/SPARK-22036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Olivier Blanvillain
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.0
>
>
> The multiplication of two BigDecimal numbers sometimes returns null. Here is 
> a minimal reproduction:
> {code:java}
> object Main extends App {
>   import org.apache.spark.{SparkConf, SparkContext}
>   import org.apache.spark.sql.SparkSession
>   import spark.implicits._
>   val conf = new 
> SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
> "false")
>   val spark = 
> SparkSession.builder().config(conf).appName("REPL").getOrCreate()
>   implicit val sqlContext = spark.sqlContext
>   case class X2(a: BigDecimal, b: BigDecimal)
>   val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
> BigDecimal(-1000.1
>   val result = ds.select(ds("a") * ds("b")).collect.head
>   println(result) // [null]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25454) Division between operands with negative scale can cause precision loss

2018-09-18 Thread Marco Gaido (JIRA)

Marco Gaido created SPARK-25454:
---

 Summary: Division between operands with negative scale can cause 
precision loss
 Key: SPARK-25454
 URL: https://issues.apache.org/jira/browse/SPARK-25454
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1, 2.3.0
Reporter: Marco Gaido


The issue was originally reported by [~bersprockets] here: 
https://issues.apache.org/jira/browse/SPARK-22036?focusedCommentId=16618104&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16618104.

The problem consist in a precision loss when the second operand of the division 
is a decimal with a negative scale. It was present also before 2.3 but it was 
harder to reproduce: you had to do something like {{lit(BigDecimal(100e6))}}, 
while now this can happen more frequently with SQL constants.

The problem is that our logic is taken from Hive and SQLServer where decimals 
with negative scales are not allowed. We might also consider enforcing this too 
in 3.0 eventually. Meanwhile we can fix the logic for computing the result type 
for a division.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25453) OracleIntegrationSuite IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]

2018-09-18 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619121#comment-16619121
 ] 

Takeshi Yamamuro commented on SPARK-25453:
--

oh, thanks. Can you fix this?

> OracleIntegrationSuite IllegalArgumentException: Timestamp format must be 
> -mm-dd hh:mm:ss[.f]
> -
>
> Key: SPARK-25453
> URL: https://issues.apache.org/jira/browse/SPARK-25453
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> - SPARK-22814 support date/timestamp types in partitionColumn *** FAILED ***
>   java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
>   at java.sql.Timestamp.valueOf(Timestamp.java:204)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.toInternalBoundValue(JDBCRelation.scala:183)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.columnPartition(JDBCRelation.scala:88)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:36)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
>   at 
> org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:445)
>   at 
> org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:427)
>   ...{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25453) OracleIntegrationSuite IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]

2018-09-18 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619095#comment-16619095
 ] 

Yuming Wang commented on SPARK-25453:
-

cc [~maropu]

> OracleIntegrationSuite IllegalArgumentException: Timestamp format must be 
> -mm-dd hh:mm:ss[.f]
> -
>
> Key: SPARK-25453
> URL: https://issues.apache.org/jira/browse/SPARK-25453
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> - SPARK-22814 support date/timestamp types in partitionColumn *** FAILED ***
>   java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
>   at java.sql.Timestamp.valueOf(Timestamp.java:204)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.toInternalBoundValue(JDBCRelation.scala:183)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.columnPartition(JDBCRelation.scala:88)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:36)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
>   at 
> org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:445)
>   at 
> org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:427)
>   ...{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25453) OracleIntegrationSuite IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]

2018-09-18 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-25453:
---

 Summary: OracleIntegrationSuite IllegalArgumentException: 
Timestamp format must be -mm-dd hh:mm:ss[.f]
 Key: SPARK-25453
 URL: https://issues.apache.org/jira/browse/SPARK-25453
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.4.0
Reporter: Yuming Wang


{noformat}
- SPARK-22814 support date/timestamp types in partitionColumn *** FAILED ***

  java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
hh:mm:ss[.f]

  at java.sql.Timestamp.valueOf(Timestamp.java:204)

  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.toInternalBoundValue(JDBCRelation.scala:183)

  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.columnPartition(JDBCRelation.scala:88)

  at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:36)

  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)

  at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)

  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)

  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)

  at 
org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:445)

  at 
org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:427)

  ...{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25452) Query with where clause is giving unexpected result in case of float column

2018-09-18 Thread Ayush Anubhava (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Anubhava updated SPARK-25452:
---
Summary: Query with where clause is giving unexpected result in case of 
float column  (was: Query with clause is giving unexpected result in case of 
float coloumn)

> Query with where clause is giving unexpected result in case of float column
> ---
>
> Key: SPARK-25452
> URL: https://issues.apache.org/jira/browse/SPARK-25452
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: *Spark 2.3.1*
> *Hadoop 2.7.2*
>Reporter: Ayush Anubhava
>Priority: Major
>
> *Description* : Query with clause is giving unexpected result in case of 
> float column 
>  
> {color:#d04437}*Query with filter less than equal to is giving inappropriate 
> result{code}*{color}
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (0,0.0);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (1,1.1);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
> +++--+
> | a | b |
> +++--+
> | 0 | 0.0 |
> | 1 | 1.10023841858 |
> +++--+
> Query with filter less than equal to is giving in appropriate result
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
> ++--+--+
> | a | b |
> ++--+--+
> | 0 | 0.0 |
> ++--+--+
> 1 row selected (0.299 seconds)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25452) Query with clause is giving unexpected result in case of float coloumn

2018-09-18 Thread Ayush Anubhava (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619060#comment-16619060
 ] 

Ayush Anubhava edited comment on SPARK-25452 at 9/18/18 12:40 PM:
--

Hi   HyukjiKwon ,

Thank you so much  for the update. 

Will cherry pick the PR and check .

My issue is related with filter where in filter datatype is float .

Ideally if datatype is float then it should internally cast the value as well 
to the corresponding datatype.

Oracle db also behaves the same .i.e we need not give cast explicitly.



was (Author: ayush007):
Hi   HyukjiKwon ,

Thank you so much  for the update. 

Will cherry pick the PR and check .

My issue is related with filter where in filter datatype is float .

Ideally if datatype is float then it should internally cast the value as well 
to the corresponding datatype.

Oracle also behaves the same .i.e we need not give cast explicitly.


> Query with clause is giving unexpected result in case of float coloumn
> --
>
> Key: SPARK-25452
> URL: https://issues.apache.org/jira/browse/SPARK-25452
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: *Spark 2.3.1*
> *Hadoop 2.7.2*
>Reporter: Ayush Anubhava
>Priority: Major
>
> *Description* : Query with clause is giving unexpected result in case of 
> float column 
>  
> {color:#d04437}*Query with filter less than equal to is giving inappropriate 
> result{code}*{color}
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (0,0.0);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (1,1.1);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
> +++--+
> | a | b |
> +++--+
> | 0 | 0.0 |
> | 1 | 1.10023841858 |
> +++--+
> Query with filter less than equal to is giving in appropriate result
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
> ++--+--+
> | a | b |
> ++--+--+
> | 0 | 0.0 |
> ++--+--+
> 1 row selected (0.299 seconds)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25452) Query with clause is giving unexpected result in case of float coloumn

2018-09-18 Thread Ayush Anubhava (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619060#comment-16619060
 ] 

Ayush Anubhava edited comment on SPARK-25452 at 9/18/18 12:39 PM:
--

Hi   HyukjiKwon ,

Thank you so much  for the update. 

Will cherry pick the PR and check .

My issue is related with filter where in filter datatype is float .

Ideally if datatype is float then it should internally cast the value as well 
to the corresponding datatype.

Oracle also behaves the same .i.e we need not give cast explicitly.



was (Author: ayush007):
Hi   HyukjiKwon ,
Thank you so much  for the update. 
Will cherry pick the PR and check .
My issue is related with filter where in filter datatype is float .
Ideally if datatype is float then it should internally cast the value as well 
to the corresponding datatype.


> Query with clause is giving unexpected result in case of float coloumn
> --
>
> Key: SPARK-25452
> URL: https://issues.apache.org/jira/browse/SPARK-25452
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: *Spark 2.3.1*
> *Hadoop 2.7.2*
>Reporter: Ayush Anubhava
>Priority: Major
>
> *Description* : Query with clause is giving unexpected result in case of 
> float column 
>  
> {color:#d04437}*Query with filter less than equal to is giving inappropriate 
> result{code}*{color}
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (0,0.0);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (1,1.1);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
> +++--+
> | a | b |
> +++--+
> | 0 | 0.0 |
> | 1 | 1.10023841858 |
> +++--+
> Query with filter less than equal to is giving in appropriate result
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
> ++--+--+
> | a | b |
> ++--+--+
> | 0 | 0.0 |
> ++--+--+
> 1 row selected (0.299 seconds)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25452) Query with clause is giving unexpected result in case of float coloumn

2018-09-18 Thread Ayush Anubhava (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619060#comment-16619060
 ] 

Ayush Anubhava commented on SPARK-25452:


Hi   HyukjiKwon ,
Thank you so much  for the update. 
Will cherry pick the PR and check .
My issue is related with filter where in filter datatype is float .
Ideally if datatype is float then it should internally cast the value as well 
to the corresponding datatype.


> Query with clause is giving unexpected result in case of float coloumn
> --
>
> Key: SPARK-25452
> URL: https://issues.apache.org/jira/browse/SPARK-25452
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: *Spark 2.3.1*
> *Hadoop 2.7.2*
>Reporter: Ayush Anubhava
>Priority: Major
>
> *Description* : Query with clause is giving unexpected result in case of 
> float column 
>  
> {color:#d04437}*Query with filter less than equal to is giving inappropriate 
> result{code}*{color}
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (0,0.0);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (1,1.1);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
> +++--+
> | a | b |
> +++--+
> | 0 | 0.0 |
> | 1 | 1.10023841858 |
> +++--+
> Query with filter less than equal to is giving in appropriate result
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
> ++--+--+
> | a | b |
> ++--+--+
> | 0 | 0.0 |
> ++--+--+
> 1 row selected (0.299 seconds)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25452) Query with clause is giving unexpected result in case of float coloumn

2018-09-18 Thread Ayush Anubhava (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Anubhava updated SPARK-25452:
---
Description: 
*Description* : Query with clause is giving unexpected result in case of float 
column 

 

{color:#d04437}*Query with filter less than equal to is giving inappropriate 
result{code}*{color}

{code}
0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (0,0.0);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (1,1.1);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
+++--+
| a | b |
+++--+
| 0 | 0.0 |
| 1 | 1.10023841858 |
+++--+

Query with filter less than equal to is giving in appropriate result

0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
++--+--+
| a | b |
++--+--+
| 0 | 0.0 |
++--+--+
1 row selected (0.299 seconds)

{code}
 

  was:
*Description* : Query with clause is giving unexpected result in case of float 
column 

 

{color:#d04437}*Query with filter less than equal to is giving in appropriate 
result{code}*{color}

{code}
0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (0,0.0);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (1,1.1);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
+++--+
| a | b |
+++--+
| 0 | 0.0 |
| 1 | 1.10023841858 |
+++--+

Query with filter less than equal to is giving in appropriate result

0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
++--+--+
| a | b |
++--+--+
| 0 | 0.0 |
++--+--+
1 row selected (0.299 seconds)

{code}
 


> Query with clause is giving unexpected result in case of float coloumn
> --
>
> Key: SPARK-25452
> URL: https://issues.apache.org/jira/browse/SPARK-25452
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: *Spark 2.3.1*
> *Hadoop 2.7.2*
>Reporter: Ayush Anubhava
>Priority: Major
>
> *Description* : Query with clause is giving unexpected result in case of 
> float column 
>  
> {color:#d04437}*Query with filter less than equal to is giving inappropriate 
> result{code}*{color}
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (0,0.0);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (1,1.1);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
> +++--+
> | a | b |
> +++--+
> | 0 | 0.0 |
> | 1 | 1.10023841858 |
> +++--+
> Query with filter less than equal to is giving in appropriate result
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
> ++--+--+
> | a | b |
> ++--+--+
> | 0 | 0.0 |
> ++--+--+
> 1 row selected (0.299 seconds)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20236) Overwrite a partitioned data source table should only overwrite related partitions

2018-09-18 Thread Deepanker (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618950#comment-16618950
 ] 

Deepanker edited comment on SPARK-20236 at 9/18/18 11:40 AM:
-

What is the difference between this Jira and these ones: 
 https://issues.apache.org/jira/browse/SPARK-18185, 

https://issues.apache.org/jira/browse/SPARK-18183

I tested this out with spark 2.2 (which confirms the fix was present before 2.3 
as well) this only works for external tables not managed tables in hive? Any 
reason why is that?

Now we can enable/disable this behaviour via this property: 
_spark.sql.sources.partitionOverwriteMode_ whereas previously it was default? 

*Update:* I got it. SPARK-20236 provides a feature flag to override this 
behaviour via the above mentioned property whereas the other Jira fixes the 
insert overwrite behaviour overall.

Although this still doesn't work for Hive managed tables only for external 
tables. Is this behaviour intentional (as in an external table is considered as 
datasource table managed via Hive whereas a managed table doesn't)? 


was (Author: deepanker):
What is the difference between this Jira and these ones: 
 https://issues.apache.org/jira/browse/SPARK-18185, 

https://issues.apache.org/jira/browse/SPARK-18183

I tested this out with spark 2.2 (which confirms the fix was present before 2.3 
as well) this only works for external tables not managed tables in hive? Any 
reason why is that?

Now we can enable/disable this behaviour via this property: 
_spark.sql.sources.partitionOverwriteMode_ whereas previously it was default? 

*Update:* I got it. SPARK-20236 provides a feature flag to override this 
behaviour via the above mentioned property whereas the other Jira fixes the 
insert overwrite behaviour overall.

> Overwrite a partitioned data source table should only overwrite related 
> partitions
> --
>
> Key: SPARK-20236
> URL: https://issues.apache.org/jira/browse/SPARK-20236
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> When we overwrite a partitioned data source table, currently Spark will 
> truncate the entire table to write new data, or truncate a bunch of 
> partitions according to the given static partitions.
> For example, {{INSERT OVERWRITE tbl ...}} will truncate the entire table, 
> {{INSERT OVERWRITE tbl PARTITION (a=1, b)}} will truncate all the partitions 
> that starts with {{a=1}}.
> This behavior is kind of reasonable as we can know which partitions will be 
> overwritten before runtime. However, hive has a different behavior that it 
> only overwrites related partitions, e.g. {{INSERT OVERWRITE tbl SELECT 
> 1,2,3}} will only overwrite partition {{a=2, b=3}}, assuming {{tbl}} has only 
> one data column and is partitioned by {{a}} and {{b}}.
> It seems better if we can follow hive's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions

2018-09-18 Thread Deepanker (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618949#comment-16618949
 ] 

Deepanker edited comment on SPARK-18185 at 9/18/18 11:40 AM:
-

What is the difference between this Jira and this one: 
https://issues.apache.org/jira/browse/SPARK-20236

I tested this out with spark 2.2 this only works for external tables not 
managed tables in hive? Any reason why is that?

With 2.3 we can enable/disable this behaviour via this property: 
_spark.sql.sources.partitionOverwriteMode_ whereas previously it was default? 

*Update:* I got it. SPARK-20236 provides a feature flag to override this 
behaviour via the above mentioned property whereas this Jira fixes the insert 
overwrite behaviour overall.

Although this still doesn't work for Hive managed tables only for external 
tables. Is this behaviour intentional (as in an external table is considered as 
datasource table managed via Hive whereas a managed table doesn't) ?


was (Author: deepanker):
What is the difference between this Jira and this one: 
https://issues.apache.org/jira/browse/SPARK-20236

I tested this out with spark 2.2 this only works for external tables not 
managed tables in hive? Any reason why is that?

With 2.3 we can enable/disable this behaviour via this property: 
_spark.sql.sources.partitionOverwriteMode_ whereas previously it was default? 

*Update:* I got it. SPARK-20236 provides a feature flag to override this 
behaviour via the above mentioned property whereas this Jira fixes the insert 
overwrite behaviour overall.

Although this still doesn't work for Hive managed tables only for external 
tables. Is this behaviour intentional?

> Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
> --
>
> Key: SPARK-18185
> URL: https://issues.apache.org/jira/browse/SPARK-18185
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Major
> Fix For: 2.1.0
>
>
> As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
> Datasource table will overwrite the entire table instead of only the updated 
> partitions as in Hive. It also doesn't respect custom partition locations.
> We should delete only the proper partitions, scan the metastore for affected 
> partitions with custom locations, and ensure that deletes/writes go to the 
> right locations for those as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions

2018-09-18 Thread Deepanker (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618949#comment-16618949
 ] 

Deepanker edited comment on SPARK-18185 at 9/18/18 11:38 AM:
-

What is the difference between this Jira and this one: 
https://issues.apache.org/jira/browse/SPARK-20236

I tested this out with spark 2.2 this only works for external tables not 
managed tables in hive? Any reason why is that?

With 2.3 we can enable/disable this behaviour via this property: 
_spark.sql.sources.partitionOverwriteMode_ whereas previously it was default? 

*Update:* I got it. SPARK-20236 provides a feature flag to override this 
behaviour via the above mentioned property whereas this Jira fixes the insert 
overwrite behaviour overall.

Although this still doesn't work for Hive managed tables only for external 
tables. Is this behaviour intentional?


was (Author: deepanker):
What is the difference between this Jira and this one: 
https://issues.apache.org/jira/browse/SPARK-20236

I tested this out with spark 2.2 this only works for external tables not 
managed tables in hive? Any reason why is that?

With 2.3 we can enable/disable this behaviour via this property: 
_spark.sql.sources.partitionOverwriteMode_ whereas previously it was default? 

*Update:* I got it. 
[SPARK-20236|https://issues.apache.org/jira/browse/SPARK-20236] provides a 
feature flag to override this behaviour via the above mentioned property 
whereas this Jira fixes the insert overwrite behaviour overall.

> Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
> --
>
> Key: SPARK-18185
> URL: https://issues.apache.org/jira/browse/SPARK-18185
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Major
> Fix For: 2.1.0
>
>
> As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
> Datasource table will overwrite the entire table instead of only the updated 
> partitions as in Hive. It also doesn't respect custom partition locations.
> We should delete only the proper partitions, scan the metastore for affected 
> partitions with custom locations, and ensure that deletes/writes go to the 
> right locations for those as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25452) Query with clause is giving unexpected result in case of float coloumn

2018-09-18 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618974#comment-16618974
 ] 

Hyukjin Kwon commented on SPARK-25452:
--

Is this a duplicate of SPARK-24829?

> Query with clause is giving unexpected result in case of float coloumn
> --
>
> Key: SPARK-25452
> URL: https://issues.apache.org/jira/browse/SPARK-25452
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: *Spark 2.3.1*
> *Hadoop 2.7.2*
>Reporter: Ayush Anubhava
>Priority: Major
>
> *Description* : Query with clause is giving unexpected result in case of 
> float column 
>  
> {color:#d04437}*Query with filter less than equal to is giving in appropriate 
> result{code}*{color}
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (0,0.0);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (1,1.1);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
> +++--+
> | a | b |
> +++--+
> | 0 | 0.0 |
> | 1 | 1.10023841858 |
> +++--+
> Query with filter less than equal to is giving in appropriate result
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
> ++--+--+
> | a | b |
> ++--+--+
> | 0 | 0.0 |
> ++--+--+
> 1 row selected (0.299 seconds)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20236) Overwrite a partitioned data source table should only overwrite related partitions

2018-09-18 Thread Deepanker (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618950#comment-16618950
 ] 

Deepanker edited comment on SPARK-20236 at 9/18/18 11:35 AM:
-

What is the difference between this Jira and these ones: 
 https://issues.apache.org/jira/browse/SPARK-18185, 

https://issues.apache.org/jira/browse/SPARK-18183

I tested this out with spark 2.2 (which confirms the fix was present before 2.3 
as well) this only works for external tables not managed tables in hive? Any 
reason why is that?

Now we can enable/disable this behaviour via this property: 
_spark.sql.sources.partitionOverwriteMode_ whereas previously it was default? 

*Update:* I got it. SPARK-20236 provides a feature flag to override this 
behaviour via the above mentioned property whereas the other Jira fixes the 
insert overwrite behaviour overall.


was (Author: deepanker):
What is the difference between this Jira and these ones: 
 https://issues.apache.org/jira/browse/SPARK-18185, 

https://issues.apache.org/jira/browse/SPARK-18183

I tested this out with spark 2.2 (which confirms the fix was present before 2.3 
as well) this only works for external tables not managed tables in hive? Any 
reason why is that?

Now we can enable/disable this behaviour via this property: 
{{spark.sql.sources.partitionOverwriteMode }}whereas previously it was default? 

> Overwrite a partitioned data source table should only overwrite related 
> partitions
> --
>
> Key: SPARK-20236
> URL: https://issues.apache.org/jira/browse/SPARK-20236
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> When we overwrite a partitioned data source table, currently Spark will 
> truncate the entire table to write new data, or truncate a bunch of 
> partitions according to the given static partitions.
> For example, {{INSERT OVERWRITE tbl ...}} will truncate the entire table, 
> {{INSERT OVERWRITE tbl PARTITION (a=1, b)}} will truncate all the partitions 
> that starts with {{a=1}}.
> This behavior is kind of reasonable as we can know which partitions will be 
> overwritten before runtime. However, hive has a different behavior that it 
> only overwrites related partitions, e.g. {{INSERT OVERWRITE tbl SELECT 
> 1,2,3}} will only overwrite partition {{a=2, b=3}}, assuming {{tbl}} has only 
> one data column and is partitioned by {{a}} and {{b}}.
> It seems better if we can follow hive's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions

2018-09-18 Thread Deepanker (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618949#comment-16618949
 ] 

Deepanker edited comment on SPARK-18185 at 9/18/18 11:34 AM:
-

What is the difference between this Jira and this one: 
https://issues.apache.org/jira/browse/SPARK-20236

I tested this out with spark 2.2 this only works for external tables not 
managed tables in hive? Any reason why is that?

With 2.3 we can enable/disable this behaviour via this property: 
_spark.sql.sources.partitionOverwriteMode_ whereas previously it was default? 

*Update:* I got it. 
[SPARK-20236|https://issues.apache.org/jira/browse/SPARK-20236] provides a 
feature flag to override this behaviour via the above mentioned property 
whereas this Jira fixes the insert overwrite behaviour overall.


was (Author: deepanker):
What is the difference between this Jira and this one: 
https://issues.apache.org/jira/browse/SPARK-20236

I tested this out with spark 2.2 this only works for external tables not 
managed tables in hive? Any reason why is that?

With 2.3 we can enable/disable this behaviour via this property: 
_spark.sql.sources.partitionOverwriteMode_ whereas previously it was default? 

> Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
> --
>
> Key: SPARK-18185
> URL: https://issues.apache.org/jira/browse/SPARK-18185
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Major
> Fix For: 2.1.0
>
>
> As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
> Datasource table will overwrite the entire table instead of only the updated 
> partitions as in Hive. It also doesn't respect custom partition locations.
> We should delete only the proper partitions, scan the metastore for affected 
> partitions with custom locations, and ensure that deletes/writes go to the 
> right locations for those as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25411) Implement range partition in Spark

2018-09-18 Thread Wang, Gang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wang, Gang updated SPARK-25411:
---
Description: 
In our product environment, there are some partitioned fact tables, which are 
all quite huge. To accelerate join execution, we need make them also bucketed. 
Than comes the problem, if the bucket number is large enough, there may be too 
many files(files count = bucket number * partition count), which may bring 
pressure to the HDFS. And if the bucket number is small, Spark will launch 
equal number of tasks to read/write it.

 

So, can we implement a new partition support range values, just like range 
partition in Oracle/MySQL 
([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]). 
Say, we can partition by a date column, and make every two months as a 
partition, or partitioned by a integer column, make interval of 1 as a 
partition.

 

Ideally, feature like range partition should be implemented in Hive. While, 
it's been always hard to update Hive version in a prod environment, and much 
lightweight and flexible if we implement it in Spark.

  was:
In our PROD environment, there are some partitioned fact tables, which are all 
quite huge. To accelerate join execution, we need make them also bucketed. Than 
comes the problem, if the bucket number is large enough, there may be too many 
files(files count = bucket number * partition count), which may bring pressure 
to the HDFS. And if the bucket number is small, Spark will launch equal number 
of tasks to read/write it.

 

So, can we implement a new partition support range values, just like range 
partition in Oracle/MySQL 
([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]). 
Say, we can partition by a date column, and make every two months as a 
partition, or partitioned by a integer column, make interval of 1 as a 
partition.

 

Ideally, feature like range partition should be implemented in Hive. While, 
it's been always hard to update Hive version in a prod environment, and much 
lightweight and flexible if we implement it in Spark.


> Implement range partition in Spark
> --
>
> Key: SPARK-25411
> URL: https://issues.apache.org/jira/browse/SPARK-25411
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wang, Gang
>Priority: Major
>
> In our product environment, there are some partitioned fact tables, which are 
> all quite huge. To accelerate join execution, we need make them also 
> bucketed. Than comes the problem, if the bucket number is large enough, there 
> may be too many files(files count = bucket number * partition count), which 
> may bring pressure to the HDFS. And if the bucket number is small, Spark will 
> launch equal number of tasks to read/write it.
>  
> So, can we implement a new partition support range values, just like range 
> partition in Oracle/MySQL 
> ([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]).
>  Say, we can partition by a date column, and make every two months as a 
> partition, or partitioned by a integer column, make interval of 1 as a 
> partition.
>  
> Ideally, feature like range partition should be implemented in Hive. While, 
> it's been always hard to update Hive version in a prod environment, and much 
> lightweight and flexible if we implement it in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25411) Implement range partition in Spark

2018-09-18 Thread Wang, Gang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wang, Gang updated SPARK-25411:
---
Description: 
In our PROD environment, there are some partitioned fact tables, which are all 
quite huge. To accelerate join execution, we need make them also bucketed. Than 
comes the problem, if the bucket number is large enough, there may be too many 
files(files count = bucket number * partition count), which may bring pressure 
to the HDFS. And if the bucket number is small, Spark will launch equal number 
of tasks to read/write it.

 

So, can we implement a new partition support range values, just like range 
partition in Oracle/MySQL 
([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]). 
Say, we can partition by a date column, and make every two months as a 
partition, or partitioned by a integer column, make interval of 1 as a 
partition.

 

Ideally, feature like range partition should be implemented in Hive. While, 
it's been always hard to update Hive version in a prod environment, and much 
lightweight and flexible if we implement it in Spark.

  was:
In our PROD environment, there are some partitioned fact tables, which are all 
quite huge. To accelerate join execution, we need make them also bucketed. Than 
comes the problem, if the bucket number is large enough, there may be two many 
files(files count = bucket number * partition count), which may bring pressure 
to the HDFS. And if the bucket number is small, Spark will launch equal number 
of tasks to read/write it.

 

So, can we implement a new partition support range values, just like range 
partition in Oracle/MySQL 
([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]). 
Say, we can partition by a date column, and make every two months as a 
partition, or partitioned by a integer column, make interval of 1 as a 
partition.

 

Ideally, feature like range partition should be implemented in Hive. While, 
it's been always hard to update Hive version in a prod environment, and much 
lightweight and flexible if we implement it in Spark.


> Implement range partition in Spark
> --
>
> Key: SPARK-25411
> URL: https://issues.apache.org/jira/browse/SPARK-25411
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wang, Gang
>Priority: Major
>
> In our PROD environment, there are some partitioned fact tables, which are 
> all quite huge. To accelerate join execution, we need make them also 
> bucketed. Than comes the problem, if the bucket number is large enough, there 
> may be too many files(files count = bucket number * partition count), which 
> may bring pressure to the HDFS. And if the bucket number is small, Spark will 
> launch equal number of tasks to read/write it.
>  
> So, can we implement a new partition support range values, just like range 
> partition in Oracle/MySQL 
> ([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]).
>  Say, we can partition by a date column, and make every two months as a 
> partition, or partitioned by a integer column, make interval of 1 as a 
> partition.
>  
> Ideally, feature like range partition should be implemented in Hive. While, 
> it's been always hard to update Hive version in a prod environment, and much 
> lightweight and flexible if we implement it in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25452) Query with clause is giving unexpected result in case of float coloumn

2018-09-18 Thread Ayush Anubhava (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Anubhava updated SPARK-25452:
---
Description: 
*Description* : Query with clause is giving unexpected result in case of float 
column 

 

{color:#d04437}*Query with filter less than equal to is giving in appropriate 
result{code}*{color}

{code}
0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (0,0.0);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (1,1.1);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
+++--+
| a | b |
+++--+
| 0 | 0.0 |
| 1 | 1.10023841858 |
+++--+

Query with filter less than equal to is giving in appropriate result

0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
++--+--+
| a | b |
++--+--+
| 0 | 0.0 |
++--+--+
1 row selected (0.299 seconds)

{code}
 

  was:
*Description* : Query with clause is giving unexpected result in case of float 
column 

 

{color:#d04437}*Query with filter less than equal to is giving in appropriate 
result{code}*{color}

0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (0,0.0);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (1,1.1);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
+++--+
| a | b |
+++--+
| 0 | 0.0 |
| 1 | 1.10023841858 |
+++--+

Query with filter less than equal to is giving in appropriate result

0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
++--+--+
| a | b |
++--+--+
| 0 | 0.0 |
++--+--+
1 row selected (0.299 seconds)

{code}
 


> Query with clause is giving unexpected result in case of float coloumn
> --
>
> Key: SPARK-25452
> URL: https://issues.apache.org/jira/browse/SPARK-25452
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: *Spark 2.3.1*
> *Hadoop 2.7.2*
>Reporter: Ayush Anubhava
>Priority: Major
>
> *Description* : Query with clause is giving unexpected result in case of 
> float column 
>  
> {color:#d04437}*Query with filter less than equal to is giving in appropriate 
> result{code}*{color}
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (0,0.0);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (1,1.1);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
> +++--+
> | a | b |
> +++--+
> | 0 | 0.0 |
> | 1 | 1.10023841858 |
> +++--+
> Query with filter less than equal to is giving in appropriate result
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
> ++--+--+
> | a | b |
> ++--+--+
> | 0 | 0.0 |
> ++--+--+
> 1 row selected (0.299 seconds)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25452) Query with clause is giving unexpected result in case of float coloumn

2018-09-18 Thread Ayush Anubhava (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Anubhava updated SPARK-25452:
---
Description: 
*Description* : Query with clause is giving unexpected result in case of float 
column 

 

{color:#d04437}*Query with filter less than equal to is giving in appropriate 
result{code}*{color}

0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (0,0.0);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (1,1.1);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
+++--+
| a | b |
+++--+
| 0 | 0.0 |
| 1 | 1.10023841858 |
+++--+

Query with filter less than equal to is giving in appropriate result

0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
++--+--+
| a | b |
++--+--+
| 0 | 0.0 |
++--+--+
1 row selected (0.299 seconds)

{code}
 

  was:
*Description* : Query with clause is giving unexpected result in case of float 
column 

 

Query with filter less than equal to is giving in appropriate result{code}

0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (0,0.0);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (1,1.1);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
+++--+
| a | b |
+++--+
| 0 | 0.0 |
| 1 | 1.10023841858 |
+++--+

Query with filter less than equal to is giving in appropriate result

0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
++--+--+
| a | b |
++--+--+
| 0 | 0.0 |
++--+--+
1 row selected (0.299 seconds)

{code}
 


> Query with clause is giving unexpected result in case of float coloumn
> --
>
> Key: SPARK-25452
> URL: https://issues.apache.org/jira/browse/SPARK-25452
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: *Spark 2.3.1*
> *Hadoop 2.7.2*
>Reporter: Ayush Anubhava
>Priority: Major
>
> *Description* : Query with clause is giving unexpected result in case of 
> float column 
>  
> {color:#d04437}*Query with filter less than equal to is giving in appropriate 
> result{code}*{color}
> 0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (0,0.0);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (1,1.1);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
> +++--+
> | a | b |
> +++--+
> | 0 | 0.0 |
> | 1 | 1.10023841858 |
> +++--+
> Query with filter less than equal to is giving in appropriate result
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
> ++--+--+
> | a | b |
> ++--+--+
> | 0 | 0.0 |
> ++--+--+
> 1 row selected (0.299 seconds)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25452) Query with clause is giving unexpected result in case of float coloumn

2018-09-18 Thread Ayush Anubhava (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Anubhava updated SPARK-25452:
---
Description: 
*Description* : Query with clause is giving unexpected result in case of float 
column 

 

Query with filter less than equal to is giving in appropriate result{code}

0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (0,0.0);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (1,1.1);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
+++--+
| a | b |
+++--+
| 0 | 0.0 |
| 1 | 1.10023841858 |
+++--+

Query with filter less than equal to is giving in appropriate result

0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
++--+--+
| a | b |
++--+--+
| 0 | 0.0 |
++--+--+
1 row selected (0.299 seconds)

{code}
 

  was:
*Description* : Query with clause is giving unexpected result in case of float 
column 

 
{code:java}
Query with filter less than equal to is giving in appropriate result{code}
{code:java}
0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (0,0.0);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (1,1.1);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
+++--+
| a | b |
+++--+
| 0 | 0.0 |
| 1 | 1.10023841858 |
+++--+

Query with filter less than equal to is giving in appropriate result

0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
++--+--+
| a | b |
++--+--+
| 0 | 0.0 |
++--+--+
1 row selected (0.299 seconds)

{code}
 


> Query with clause is giving unexpected result in case of float coloumn
> --
>
> Key: SPARK-25452
> URL: https://issues.apache.org/jira/browse/SPARK-25452
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: *Spark 2.3.1*
> *Hadoop 2.7.2*
>Reporter: Ayush Anubhava
>Priority: Major
>
> *Description* : Query with clause is giving unexpected result in case of 
> float column 
>  
> Query with filter less than equal to is giving in appropriate result{code}
> 0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (0,0.0);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (1,1.1);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
> +++--+
> | a | b |
> +++--+
> | 0 | 0.0 |
> | 1 | 1.10023841858 |
> +++--+
> Query with filter less than equal to is giving in appropriate result
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
> ++--+--+
> | a | b |
> ++--+--+
> | 0 | 0.0 |
> ++--+--+
> 1 row selected (0.299 seconds)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25452) Query with clause is giving unexpected result in case of float coloumn

2018-09-18 Thread Ayush Anubhava (JIRA)

Ayush Anubhava created SPARK-25452:
--

 Summary: Query with clause is giving unexpected result in case of 
float coloumn
 Key: SPARK-25452
 URL: https://issues.apache.org/jira/browse/SPARK-25452
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
 Environment: *Spark 2.3.1*

*Hadoop 2.7.2*
Reporter: Ayush Anubhava


*Description* : Query with clause is giving unexpected result in case of float 
column 
{code:java}
0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (0,0.0);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (1,1.1);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
+++--+
| a | b |
+++--+
| 0 | 0.0 |
| 1 | 1.10023841858 |
+++--+

Query with filter less than equal to is giving in appropriate result

0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
++--+--+
| a | b |
++--+--+
| 0 | 0.0 |
++--+--+
1 row selected (0.299 seconds)

{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25452) Query with clause is giving unexpected result in case of float coloumn

2018-09-18 Thread Ayush Anubhava (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Anubhava updated SPARK-25452:
---
Description: 
*Description* : Query with clause is giving unexpected result in case of float 
column 

 
{code:java}
Query with filter less than equal to is giving in appropriate result{code}
{code:java}
0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (0,0.0);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (1,1.1);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
+++--+
| a | b |
+++--+
| 0 | 0.0 |
| 1 | 1.10023841858 |
+++--+

Query with filter less than equal to is giving in appropriate result

0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
++--+--+
| a | b |
++--+--+
| 0 | 0.0 |
++--+--+
1 row selected (0.299 seconds)

{code}
 

  was:
*Description* : Query with clause is giving unexpected result in case of float 
column 
{code:java}
0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (0,0.0);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values (1,1.1);
+-+--+
| Result |
+-+--+
+-+--+
0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
+++--+
| a | b |
+++--+
| 0 | 0.0 |
| 1 | 1.10023841858 |
+++--+

Query with filter less than equal to is giving in appropriate result

0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
++--+--+
| a | b |
++--+--+
| 0 | 0.0 |
++--+--+
1 row selected (0.299 seconds)

{code}
 


> Query with clause is giving unexpected result in case of float coloumn
> --
>
> Key: SPARK-25452
> URL: https://issues.apache.org/jira/browse/SPARK-25452
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: *Spark 2.3.1*
> *Hadoop 2.7.2*
>Reporter: Ayush Anubhava
>Priority: Major
>
> *Description* : Query with clause is giving unexpected result in case of 
> float column 
>  
> {code:java}
> Query with filter less than equal to is giving in appropriate result{code}
> {code:java}
> 0: jdbc:hive2://10.18.18.214:23040/default> create table k2 ( a int, b float);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (0,0.0);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> insert into table k2 values 
> (1,1.1);
> +-+--+
> | Result |
> +-+--+
> +-+--+
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b >=0.0;
> +++--+
> | a | b |
> +++--+
> | 0 | 0.0 |
> | 1 | 1.10023841858 |
> +++--+
> Query with filter less than equal to is giving in appropriate result
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from k2 where b <=1.1;
> ++--+--+
> | a | b |
> ++--+--+
> | 0 | 0.0 |
> ++--+--+
> 1 row selected (0.299 seconds)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions

2018-09-18 Thread Deepanker (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618949#comment-16618949
 ] 

Deepanker edited comment on SPARK-18185 at 9/18/18 11:20 AM:
-

What is the difference between this Jira and this one: 
https://issues.apache.org/jira/browse/SPARK-20236

I tested this out with spark 2.2 this only works for external tables not 
managed tables in hive? Any reason why is that?

With 2.3 we can enable/disable this behaviour via this property: 
_spark.sql.sources.partitionOverwriteMode_ whereas previously it was default? 


was (Author: deepanker):
What is the difference between this Jira and this one: 
https://issues.apache.org/jira/browse/SPARK-20236

I tested this out with spark 2.2 this only works for external tables not 
managed tables in hive? Any reason why is that?

With 2.3 we can enable/disable this behaviour via this property: 
\{{spark.sql.sources.partitionOverwriteMode }} whereas previously it was 
default? 

> Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
> --
>
> Key: SPARK-18185
> URL: https://issues.apache.org/jira/browse/SPARK-18185
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Major
> Fix For: 2.1.0
>
>
> As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
> Datasource table will overwrite the entire table instead of only the updated 
> partitions as in Hive. It also doesn't respect custom partition locations.
> We should delete only the proper partitions, scan the metastore for affected 
> partitions with custom locations, and ensure that deletes/writes go to the 
> right locations for those as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23081) Add colRegex API to PySpark

2018-09-18 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618955#comment-16618955
 ] 

Hyukjin Kwon commented on SPARK-23081:
--

SPARK-12139 should be a proper place to discuss. If that's a question, please 
ask it to mailing list to discuss further.

> Add colRegex API to PySpark
> ---
>
> Key: SPARK-23081
> URL: https://issues.apache.org/jira/browse/SPARK-23081
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions

2018-09-18 Thread Deepanker (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618949#comment-16618949
 ] 

Deepanker edited comment on SPARK-18185 at 9/18/18 11:18 AM:
-

What is the difference between this Jira and this one: 
https://issues.apache.org/jira/browse/SPARK-20236

I tested this out with spark 2.2 this only works for external tables not 
managed tables in hive? Any reason why is that?

With 2.3 we can enable/disable this behaviour via this property: 
\{{spark.sql.sources.partitionOverwriteMode }} whereas previously it was 
default? 


was (Author: deepanker):
What is the difference between this Jira and this one: 
https://issues.apache.org/jira/browse/SPARK-20236

I tested this out with spark 2.2 this only works for external tables not 
managed tables in hive? Any reason why is that?

 

> Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
> --
>
> Key: SPARK-18185
> URL: https://issues.apache.org/jira/browse/SPARK-18185
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Major
> Fix For: 2.1.0
>
>
> As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
> Datasource table will overwrite the entire table instead of only the updated 
> partitions as in Hive. It also doesn't respect custom partition locations.
> We should delete only the proper partitions, scan the metastore for affected 
> partitions with custom locations, and ensure that deletes/writes go to the 
> right locations for those as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20236) Overwrite a partitioned data source table should only overwrite related partitions

2018-09-18 Thread Deepanker (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618950#comment-16618950
 ] 

Deepanker edited comment on SPARK-20236 at 9/18/18 11:17 AM:
-

What is the difference between this Jira and these ones: 
 https://issues.apache.org/jira/browse/SPARK-18185, 

https://issues.apache.org/jira/browse/SPARK-18183

I tested this out with spark 2.2 (which confirms the fix was present before 2.3 
as well) this only works for external tables not managed tables in hive? Any 
reason why is that?

Now we can enable/disable this behaviour via this property: 
{{spark.sql.sources.partitionOverwriteMode }}whereas previously it was default? 


was (Author: deepanker):
What is the difference between this Jira and these ones: 
https://issues.apache.org/jira/browse/SPARK-18185, 

https://issues.apache.org/jira/browse/SPARK-18183

I tested this out with spark 2.2 (which confirms the fix was present before 2.3 
as well) this only works for external tables not managed tables in hive? Any 
reason why is that?

> Overwrite a partitioned data source table should only overwrite related 
> partitions
> --
>
> Key: SPARK-20236
> URL: https://issues.apache.org/jira/browse/SPARK-20236
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> When we overwrite a partitioned data source table, currently Spark will 
> truncate the entire table to write new data, or truncate a bunch of 
> partitions according to the given static partitions.
> For example, {{INSERT OVERWRITE tbl ...}} will truncate the entire table, 
> {{INSERT OVERWRITE tbl PARTITION (a=1, b)}} will truncate all the partitions 
> that starts with {{a=1}}.
> This behavior is kind of reasonable as we can know which partitions will be 
> overwritten before runtime. However, hive has a different behavior that it 
> only overwrites related partitions, e.g. {{INSERT OVERWRITE tbl SELECT 
> 1,2,3}} will only overwrite partition {{a=2, b=3}}, assuming {{tbl}} has only 
> one data column and is partitioned by {{a}} and {{b}}.
> It seems better if we can follow hive's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions

2018-09-18 Thread Deepanker (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618949#comment-16618949
 ] 

Deepanker edited comment on SPARK-18185 at 9/18/18 11:15 AM:
-

What is the difference between this Jira and this one: 
https://issues.apache.org/jira/browse/SPARK-20236

I tested this out with spark 2.2 this only works for external tables not 
managed tables in hive? Any reason why is that?

 


was (Author: deepanker):
What is the difference between this and this one: 
https://issues.apache.org/jira/browse/SPARK-20236

I tested this out with spark 2.2 this only works for external tables not 
managed tables in hive? Any reason why is that?

 

> Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
> --
>
> Key: SPARK-18185
> URL: https://issues.apache.org/jira/browse/SPARK-18185
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Major
> Fix For: 2.1.0
>
>
> As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
> Datasource table will overwrite the entire table instead of only the updated 
> partitions as in Hive. It also doesn't respect custom partition locations.
> We should delete only the proper partitions, scan the metastore for affected 
> partitions with custom locations, and ensure that deletes/writes go to the 
> right locations for those as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20236) Overwrite a partitioned data source table should only overwrite related partitions

2018-09-18 Thread Deepanker (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618950#comment-16618950
 ] 

Deepanker commented on SPARK-20236:
---

What is the difference between this Jira and these ones: 
https://issues.apache.org/jira/browse/SPARK-18185, 

https://issues.apache.org/jira/browse/SPARK-18183

I tested this out with spark 2.2 (which confirms the fix was present before 2.3 
as well) this only works for external tables not managed tables in hive? Any 
reason why is that?

> Overwrite a partitioned data source table should only overwrite related 
> partitions
> --
>
> Key: SPARK-20236
> URL: https://issues.apache.org/jira/browse/SPARK-20236
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> When we overwrite a partitioned data source table, currently Spark will 
> truncate the entire table to write new data, or truncate a bunch of 
> partitions according to the given static partitions.
> For example, {{INSERT OVERWRITE tbl ...}} will truncate the entire table, 
> {{INSERT OVERWRITE tbl PARTITION (a=1, b)}} will truncate all the partitions 
> that starts with {{a=1}}.
> This behavior is kind of reasonable as we can know which partitions will be 
> overwritten before runtime. However, hive has a different behavior that it 
> only overwrites related partitions, e.g. {{INSERT OVERWRITE tbl SELECT 
> 1,2,3}} will only overwrite partition {{a=2, b=3}}, assuming {{tbl}} has only 
> one data column and is partitioned by {{a}} and {{b}}.
> It seems better if we can follow hive's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions

2018-09-18 Thread Deepanker (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618949#comment-16618949
 ] 

Deepanker commented on SPARK-18185:
---

What is the difference between this and this one: 
https://issues.apache.org/jira/browse/SPARK-20236

I tested this out with spark 2.2 this only works for external tables not 
managed tables in hive? Any reason why is that?

 

> Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
> --
>
> Key: SPARK-18185
> URL: https://issues.apache.org/jira/browse/SPARK-18185
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Major
> Fix For: 2.1.0
>
>
> As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
> Datasource table will overwrite the entire table instead of only the updated 
> partitions as in Hive. It also doesn't respect custom partition locations.
> We should delete only the proper partitions, scan the metastore for affected 
> partitions with custom locations, and ensure that deletes/writes go to the 
> right locations for those as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22036) BigDecimal multiplication sometimes returns null

2018-09-18 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618892#comment-16618892
 ] 

Marco Gaido commented on SPARK-22036:
-

[~bersprockets] first of all thank you for reporting this and sorry for my 
mistake on this.

I think the solution you are suggesting isn't the right one. Also the result in 
the case allowPrecisionLoss=true should not have any truncation here. The 
problem is the way we handle negative scale. So this issue I think is related 
to SPARK-24468. The problem is that Hive and MSSQL we are taking our rules from 
are not allowing negative scale, while we do. So this has to be revisited. May 
you please submit a new JIRA for this? Meanwhile I am starting working on it 
and I'll submit a fix ASAP. Sorry for the trouble. Thanks.

> BigDecimal multiplication sometimes returns null
> 
>
> Key: SPARK-22036
> URL: https://issues.apache.org/jira/browse/SPARK-22036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Olivier Blanvillain
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.0
>
>
> The multiplication of two BigDecimal numbers sometimes returns null. Here is 
> a minimal reproduction:
> {code:java}
> object Main extends App {
>   import org.apache.spark.{SparkConf, SparkContext}
>   import org.apache.spark.sql.SparkSession
>   import spark.implicits._
>   val conf = new 
> SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
> "false")
>   val spark = 
> SparkSession.builder().config(conf).appName("REPL").getOrCreate()
>   implicit val sqlContext = spark.sqlContext
>   case class X2(a: BigDecimal, b: BigDecimal)
>   val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
> BigDecimal(-1000.1
>   val result = ds.select(ds("a") * ds("b")).collect.head
>   println(result) // [null]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23081) Add colRegex API to PySpark

2018-09-18 Thread Darrell Taylor (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618874#comment-16618874
 ] 

Darrell Taylor commented on SPARK-23081:


I tend to agree that I'm unsure why this was added as its easily done in 
PySpark.  But my main reason to comment is that the implemenation feels 
incorrect.  I'm unable to chain functions together and need to reference the 
dataframe.  e.g.

```
spark.table('xyz').colRegex('foobar').printSchema()
```
Feels like the natural way to use it, but I have to do it in two parts...
```
df=spark.table('xyz')
df.select(df.colRegex('foobar')).printSchema()
```
I don't think any of the other DataFrame functions work like this?

> Add colRegex API to PySpark
> ---
>
> Key: SPARK-23081
> URL: https://issues.apache.org/jira/browse/SPARK-23081
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25451) Stages page doesn't show the right number of the total tasks

2018-09-18 Thread zuotingbing (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618837#comment-16618837
 ] 

zuotingbing edited comment on SPARK-25451 at 9/18/18 10:11 AM:
---

yes, thanks [~yumwang]


was (Author: zuo.tingbing9):
yes, thanks

> Stages page doesn't show the right number of the total tasks
> 
>
> Key: SPARK-25451
> URL: https://issues.apache.org/jira/browse/SPARK-25451
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: zuotingbing
>Priority: Major
> Attachments: mshot.png
>
>
>  
> See the attached pic.
>   !mshot.png!
> The executor 1 has 7 tasks, but in the Stages Page the total tasks of 
> executor is 6.
>  
> to reproduce this simply start a shell:
> {code:java}
> $SPARK_HOME/bin/spark-shell --executor-cores 1 --executor-memory 1g 
> --total-executor-cores 2 --master spark://localhost.localdomain:7077{code}
> Run job as fellows:
> {code:java}
> sc.parallelize(1 to 1, 3).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() {code}
>  
> Go to the stages page and you will see the Total Tasks  is not right in
> {code:java}
> Aggregated Metrics by Executor{code}
> table. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25451) Stages page doesn't show the right number of the total tasks

2018-09-18 Thread zuotingbing (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-25451:

Target Version/s:   (was: 2.3.1)

> Stages page doesn't show the right number of the total tasks
> 
>
> Key: SPARK-25451
> URL: https://issues.apache.org/jira/browse/SPARK-25451
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: zuotingbing
>Priority: Major
> Attachments: mshot.png
>
>
>  
> See the attached pic.
>   !mshot.png!
> The executor 1 has 7 tasks, but in the Stages Page the total tasks of 
> executor is 6.
>  
> to reproduce this simply start a shell:
> {code:java}
> $SPARK_HOME/bin/spark-shell --executor-cores 1 --executor-memory 1g 
> --total-executor-cores 2 --master spark://localhost.localdomain:7077{code}
> Run job as fellows:
> {code:java}
> sc.parallelize(1 to 1, 3).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() {code}
>  
> Go to the stages page and you will see the Total Tasks  is not right in
> {code:java}
> Aggregated Metrics by Executor{code}
> table. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25451) Stages page doesn't show the right number of the total tasks

2018-09-18 Thread zuotingbing (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618837#comment-16618837
 ] 

zuotingbing commented on SPARK-25451:
-

yes, thanks

> Stages page doesn't show the right number of the total tasks
> 
>
> Key: SPARK-25451
> URL: https://issues.apache.org/jira/browse/SPARK-25451
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: zuotingbing
>Priority: Major
> Attachments: mshot.png
>
>
>  
> See the attached pic.
>   !mshot.png!
> The executor 1 has 7 tasks, but in the Stages Page the total tasks of 
> executor is 6.
>  
> to reproduce this simply start a shell:
> {code:java}
> $SPARK_HOME/bin/spark-shell --executor-cores 1 --executor-memory 1g 
> --total-executor-cores 2 --master spark://localhost.localdomain:7077{code}
> Run job as fellows:
> {code:java}
> sc.parallelize(1 to 1, 3).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() {code}
>  
> Go to the stages page and you will see the Total Tasks  is not right in
> {code:java}
> Aggregated Metrics by Executor{code}
> table. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25451) Stages page doesn't show the right number of the total tasks

2018-09-18 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618761#comment-16618761
 ] 

Yuming Wang edited comment on SPARK-25451 at 9/18/18 9:26 AM:
--

Please avoid to set the {{Target Version/s}} which is usually reserved for 
committers.


was (Author: q79969786):
Please avoid to set the Target Version/s which is usually reserved for 
committers.

> Stages page doesn't show the right number of the total tasks
> 
>
> Key: SPARK-25451
> URL: https://issues.apache.org/jira/browse/SPARK-25451
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: zuotingbing
>Priority: Major
> Attachments: mshot.png
>
>
>  
> See the attached pic.
>   !mshot.png!
> The executor 1 has 7 tasks, but in the Stages Page the total tasks of 
> executor is 6.
>  
> to reproduce this simply start a shell:
> {code:java}
> $SPARK_HOME/bin/spark-shell --executor-cores 1 --executor-memory 1g 
> --total-executor-cores 2 --master spark://localhost.localdomain:7077{code}
> Run job as fellows:
> {code:java}
> sc.parallelize(1 to 1, 3).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() {code}
>  
> Go to the stages page and you will see the Total Tasks  is not right in
> {code:java}
> Aggregated Metrics by Executor{code}
> table. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25451) Stages page doesn't show the right number of the total tasks

2018-09-18 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618761#comment-16618761
 ] 

Yuming Wang commented on SPARK-25451:
-

Please avoid to set the Target Version/s which is usually reserved for 
committers.

> Stages page doesn't show the right number of the total tasks
> 
>
> Key: SPARK-25451
> URL: https://issues.apache.org/jira/browse/SPARK-25451
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: zuotingbing
>Priority: Major
> Attachments: mshot.png
>
>
>  
> See the attached pic.
>   !mshot.png!
> The executor 1 has 7 tasks, but in the Stages Page the total tasks of 
> executor is 6.
>  
> to reproduce this simply start a shell:
> {code:java}
> $SPARK_HOME/bin/spark-shell --executor-cores 1 --executor-memory 1g 
> --total-executor-cores 2 --master spark://localhost.localdomain:7077{code}
> Run job as fellows:
> {code:java}
> sc.parallelize(1 to 1, 3).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() {code}
>  
> Go to the stages page and you will see the Total Tasks  is not right in
> {code:java}
> Aggregated Metrics by Executor{code}
> table. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25451) Stages page doesn't show the right number of the total tasks

2018-09-18 Thread zuotingbing (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-25451:

Description: 
 

See the attached pic.

  !mshot.png!

The executor 1 has 7 tasks, but in the Stages Page the total tasks of executor 
is 6.

 

to reproduce this simply start a shell:
{code:java}
$SPARK_HOME/bin/spark-shell --executor-cores 1 --executor-memory 1g 
--total-executor-cores 2 --master spark://localhost.localdomain:7077{code}
Run job as fellows:
{code:java}
sc.parallelize(1 to 1, 3).map{ x => throw new RuntimeException("Bad 
executor")}.collect() {code}
 

Go to the stages page and you will see the Total Tasks  is not right in
{code:java}
Aggregated Metrics by Executor{code}
table. 

 

  was:
 

See the attached pic.

 

!image-2018-09-18-16-35-09-548.png!

The executor 1 has 7 tasks, but in the Stages Page the total tasks of executor 
is 6.

 

to reproduce this simply start a shell:

$SPARK_HOME/bin/spark-shell --executor-cores 1 --executor-memory 1g 
--total-executor-cores 2 --master spark://localhost.localdomain:7077

Run job as fellows:

 

 
{code:java}
sc.parallelize(1 to 1, 3).map{ x => throw new RuntimeException("Bad 
executor")}.collect() {code}
 

Go to the stages page and you will see the Total Tasks  is not right in
{code:java}
Aggregated Metrics by Executor{code}
table. 

 


> Stages page doesn't show the right number of the total tasks
> 
>
> Key: SPARK-25451
> URL: https://issues.apache.org/jira/browse/SPARK-25451
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: zuotingbing
>Priority: Major
> Attachments: mshot.png
>
>
>  
> See the attached pic.
>   !mshot.png!
> The executor 1 has 7 tasks, but in the Stages Page the total tasks of 
> executor is 6.
>  
> to reproduce this simply start a shell:
> {code:java}
> $SPARK_HOME/bin/spark-shell --executor-cores 1 --executor-memory 1g 
> --total-executor-cores 2 --master spark://localhost.localdomain:7077{code}
> Run job as fellows:
> {code:java}
> sc.parallelize(1 to 1, 3).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() {code}
>  
> Go to the stages page and you will see the Total Tasks  is not right in
> {code:java}
> Aggregated Metrics by Executor{code}
> table. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25451) Stages page doesn't show the right number of the total tasks

2018-09-18 Thread zuotingbing (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-25451:

Description: 
 

See the attached pic.

 

!image-2018-09-18-16-35-09-548.png!

The executor 1 has 7 tasks, but in the Stages Page the total tasks of executor 
is 6.

 

to reproduce this simply start a shell:

$SPARK_HOME/bin/spark-shell --executor-cores 1 --executor-memory 1g 
--total-executor-cores 2 --master spark://localhost.localdomain:7077

Run job as fellows:

 

 
{code:java}
sc.parallelize(1 to 1, 3).map{ x => throw new RuntimeException("Bad 
executor")}.collect() {code}
 

Go to the stages page and you will see the Total Tasks  is not right in
{code:java}
Aggregated Metrics by Executor{code}
table. 

 

  was:
 

See the attached pic.

!image-2018-09-18-16-35-09-548.png!

The executor 1 has 7 tasks, but in the Stages Page the total tasks of executor 
is 6.

 

to reproduce this simply start a shell:

$SPARK_HOME/bin/spark-shell --executor-cores 1 --executor-memory 1g 
--total-executor-cores 2 --master spark://localhost.localdomain:7077

Run job as fellows:

 

 
{code:java}
sc.parallelize(1 to 1, 3).map{ x => throw new RuntimeException("Bad 
executor")}.collect() {code}
 

Go to the stages page and you will see the Total Tasks  is not right in
{code:java}
Aggregated Metrics by Executor{code}
table. 

 


> Stages page doesn't show the right number of the total tasks
> 
>
> Key: SPARK-25451
> URL: https://issues.apache.org/jira/browse/SPARK-25451
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: zuotingbing
>Priority: Major
> Attachments: mshot.png
>
>
>  
> See the attached pic.
>  
> !image-2018-09-18-16-35-09-548.png!
> The executor 1 has 7 tasks, but in the Stages Page the total tasks of 
> executor is 6.
>  
> to reproduce this simply start a shell:
> $SPARK_HOME/bin/spark-shell --executor-cores 1 --executor-memory 1g 
> --total-executor-cores 2 --master spark://localhost.localdomain:7077
> Run job as fellows:
>  
>  
> {code:java}
> sc.parallelize(1 to 1, 3).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() {code}
>  
> Go to the stages page and you will see the Total Tasks  is not right in
> {code:java}
> Aggregated Metrics by Executor{code}
> table. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25451) Stages page doesn't show the right number of the total tasks

2018-09-18 Thread zuotingbing (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-25451:

Attachment: mshot.png

> Stages page doesn't show the right number of the total tasks
> 
>
> Key: SPARK-25451
> URL: https://issues.apache.org/jira/browse/SPARK-25451
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: zuotingbing
>Priority: Major
> Attachments: mshot.png
>
>
>  
> See the attached pic.
> !image-2018-09-18-16-35-09-548.png!
> The executor 1 has 7 tasks, but in the Stages Page the total tasks of 
> executor is 6.
>  
> to reproduce this simply start a shell:
> $SPARK_HOME/bin/spark-shell --executor-cores 1 --executor-memory 1g 
> --total-executor-cores 2 --master spark://localhost.localdomain:7077
> Run job as fellows:
>  
>  
> {code:java}
> sc.parallelize(1 to 1, 3).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() {code}
>  
> Go to the stages page and you will see the Total Tasks  is not right in
> {code:java}
> Aggregated Metrics by Executor{code}
> table. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25451) Stages page doesn't show the right number of the total tasks

2018-09-18 Thread zuotingbing (JIRA)

zuotingbing created SPARK-25451:
---

 Summary: Stages page doesn't show the right number of the total 
tasks
 Key: SPARK-25451
 URL: https://issues.apache.org/jira/browse/SPARK-25451
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.1
Reporter: zuotingbing


 

See the attached pic.

!image-2018-09-18-16-35-09-548.png!

The executor 1 has 7 tasks, but in the Stages Page the total tasks of executor 
is 6.

 

to reproduce this simply start a shell:

$SPARK_HOME/bin/spark-shell --executor-cores 1 --executor-memory 1g 
--total-executor-cores 2 --master spark://localhost.localdomain:7077

Run job as fellows:

 

 
{code:java}
sc.parallelize(1 to 1, 3).map{ x => throw new RuntimeException("Bad 
executor")}.collect() {code}
 

Go to the stages page and you will see the Total Tasks  is not right in
{code:java}
Aggregated Metrics by Executor{code}
table. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4

2018-09-18 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618588#comment-16618588
 ] 

Liang-Chi Hsieh commented on SPARK-25378:
-

Hmm.. have we decided to include a fixing into 2.4?

> ArrayData.toArray(StringType) assume UTF8String in 2.4
> --
>
> Key: SPARK-25378
> URL: https://issues.apache.org/jira/browse/SPARK-25378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:
> {code}
> import org.apache.spark.sql.catalyst.util._
> import org.apache.spark.sql.types.StringType
> ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)
> res0: Array[String] = Array(a, b)
> {code}
> In 2.4.0-SNAPSHOT, the error is
> {code}java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
>   ... 51 elided
> {code}
> cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

74 matches

Mail list logo