date:20190104

[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Description: 
For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

!image-2019-01-05-13-18-30-487.png!

We can filter the useless partition(0B) with ExchangeCoorditinator automatically

  was:
For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

 !screenshot-1.png! 

We can filter the useless partition(0B) with ExchangeCoorditinator automatically


> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: image-2019-01-05-13-18-30-487.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
> !image-2019-01-05-13-18-30-487.png!
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26544) escape string when serialize map/array make it a valid json (keep alignment with hive)

2019-01-04 Thread EdisonWang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-26544:
---
Summary: escape string when serialize map/array make it a valid json (keep 
alignment with hive)  (was: the string serialized from map/array type is not a 
valid json (while hive is))

> escape string when serialize map/array make it a valid json (keep alignment 
> with hive)
> --
>
> Key: SPARK-26544
> URL: https://issues.apache.org/jira/browse/SPARK-26544
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Priority: Major
>
> when reading a hive table with map/array type, the string serialized by 
> thrift server is not a valid json, while hive is. 
> For example, select a field whose type is map, the spark 
> thrift server returns 
>  
> {code:java}
> {"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"}
> {code}
>  
> while hive thriftserver returns
>  
> {code:java}
> {"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26544) escape string when serialize map/array to make it a valid json (keep alignment with hive)

2019-01-04 Thread EdisonWang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-26544:
---
Summary: escape string when serialize map/array to make it a valid json 
(keep alignment with hive)  (was: escape string when serialize map/array make 
it a valid json (keep alignment with hive))

> escape string when serialize map/array to make it a valid json (keep 
> alignment with hive)
> -
>
> Key: SPARK-26544
> URL: https://issues.apache.org/jira/browse/SPARK-26544
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Priority: Major
>
> when reading a hive table with map/array type, the string serialized by 
> thrift server is not a valid json, while hive is. 
> For example, select a field whose type is map, the spark 
> thrift server returns 
>  
> {code:java}
> {"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"}
> {code}
>  
> while hive thriftserver returns
>  
> {code:java}
> {"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Attachment: image-2019-01-05-13-18-30-487.png

> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: image-2019-01-05-13-18-30-487.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
>  !screenshot-1.png! 
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Attachment: screenshot-1.png

> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: screenshot-1.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
> !15_24_38__12_27_2018.jpg!
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Attachment: (was: screenshot-1.png)

> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: image-2019-01-05-13-18-30-487.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
>  !screenshot-1.png! 
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Description: 
For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

 !screenshot-1.png! 

We can filter the useless partition(0B) with ExchangeCoorditinator automatically

  was:
For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

{noformat}
 !screenshot-1.png! 
{noformat}

We can filter the useless partition(0B) with ExchangeCoorditinator automatically


> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: screenshot-1.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
>  !screenshot-1.png! 
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Description: 
For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

{noformat}
 !screenshot-1.png! 
{noformat}

We can filter the useless partition(0B) with ExchangeCoorditinator automatically

  was:
For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

We can filter the useless partition(0B) with ExchangeCoorditinator automatically


> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: screenshot-1.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
> {noformat}
>  !screenshot-1.png! 
> {noformat}
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Description: 
For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

We can filter the useless partition(0B) with ExchangeCoorditinator automatically

  was:
For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

 !screenshot-1.png! !15_24_38__12_27_2018.jpg!

We can filter the useless partition(0B) with ExchangeCoorditinator automatically


> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: screenshot-1.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Description: 
For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

 !screenshot-1.png! !15_24_38__12_27_2018.jpg!

We can filter the useless partition(0B) with ExchangeCoorditinator automatically

  was:


For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

!15_24_38__12_27_2018.jpg!

We can filter the useless partition(0B) with ExchangeCoorditinator automatically


> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: screenshot-1.png
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
>  !screenshot-1.png! !15_24_38__12_27_2018.jpg!
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Attachment: (was: 15_24_38__12_27_2018.jpg)

> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
> !15_24_38__12_27_2018.jpg!
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26544) the string serialized from map/array type is not a valid json (while hive is)

2019-01-04 Thread EdisonWang (JIRA)

EdisonWang created SPARK-26544:
--

 Summary: the string serialized from map/array type is not a valid 
json (while hive is)
 Key: SPARK-26544
 URL: https://issues.apache.org/jira/browse/SPARK-26544
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: EdisonWang


when reading a hive table with map/array type, the string serialized by thrift 
server is not a valid json, while hive is. 

For example, select a field whose type is map, the spark thrift 
server returns 

 
{code:java}
{"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"}
{code}
 

while hive thriftserver returns

 
{code:java}
{"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"}
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Description: 


For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
ExchangeCoordinator will introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records 0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
action is unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

!15_24_38__12_27_2018.jpg!

We can filter the useless partition(0B) with ExchangeCoorditinator automatically

  was:
For SparkSQL ,when we  open AE by 'set spark.sql.adapative.enable=true'，the 
ExchangeCoordinator will  introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records  0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but  this 
action is  unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

 !15_24_38__12_27_2018.jpg! 

We can filter the useless partition(0B) with ExchangeCoorditinator 
automatically 




> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: 15_24_38__12_27_2018.jpg
>
>
> For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true'，the 
> ExchangeCoordinator will introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records 0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this 
> action is unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
> !15_24_38__12_27_2018.jpg!
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Description: 
For SparkSQL ,when we  open AE by 'set spark.sql.adapative.enable=true'，the 
ExchangeCoordinator will  introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records  0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but  this 
action is  unreasonable as targetPostShuffleInputSize Should not be set too 
large. As follow:

 !15_24_38__12_27_2018.jpg! 

We can filter the useless partition(0B) with ExchangeCoorditinator 
automatically 



  was:
For SparkSQL ,when we  open AE by 'set spark.sql.adapative.enable=true'，the 
ExchangeCoordinator will  introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records  0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but  this 
action is  unreasonable as targetPostShuffleInputSize Should not be set too 
large. 


We can filter the useless partition(0B) with ExchangeCoorditinator 
automatically 




> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: 15_24_38__12_27_2018.jpg
>
>
> For SparkSQL ,when we  open AE by 'set spark.sql.adapative.enable=true'，the 
> ExchangeCoordinator will  introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records  0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but  this 
> action is  unreasonable as targetPostShuffleInputSize Should not be set too 
> large. As follow:
>  !15_24_38__12_27_2018.jpg! 
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenliang updated SPARK-26543:
--
Attachment: 15_24_38__12_27_2018.jpg

> Support the coordinator to demerminte post-shuffle partitions more reasonably
> -
>
> Key: SPARK-26543
> URL: https://issues.apache.org/jira/browse/SPARK-26543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: chenliang
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: 15_24_38__12_27_2018.jpg
>
>
> For SparkSQL ,when we  open AE by 'set spark.sql.adapative.enable=true'，the 
> ExchangeCoordinator will  introduced to determine the number of post-shuffle 
> partitions. But in some certain conditions,the coordinator performed not very 
> well, there are always some tasks retained and they worked with Shuffle Read 
> Size / Records  0.0B/0 ,We could increase the 
> spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but  this 
> action is  unreasonable as targetPostShuffleInputSize Should not be set too 
> large. 
> We can filter the useless partition(0B) with ExchangeCoorditinator 
> automatically 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)

chenliang created SPARK-26543:
-

 Summary: Support the coordinator to demerminte post-shuffle 
partitions more reasonably
 Key: SPARK-26543
 URL: https://issues.apache.org/jira/browse/SPARK-26543
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.2, 2.3.1, 2.3.0, 2.2.2, 2.2.1, 2.2.0
Reporter: chenliang
 Fix For: 2.3.0


For SparkSQL ,when we  open AE by 'set spark.sql.adapative.enable=true'，the 
ExchangeCoordinator will  introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records  0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but  this 
action is  unreasonable as targetPostShuffleInputSize Should not be set too 
large. 


We can filter the useless partition(0B) with ExchangeCoorditinator 
automatically 





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26542) Support the coordinator to demerminte post-shuffle partitions more reasonably

2019-01-04 Thread chenliang (JIRA)

chenliang created SPARK-26542:
-

 Summary: Support the coordinator to demerminte post-shuffle 
partitions more reasonably
 Key: SPARK-26542
 URL: https://issues.apache.org/jira/browse/SPARK-26542
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.2, 2.3.1, 2.3.0, 2.2.2, 2.2.1, 2.2.0
Reporter: chenliang
 Fix For: 2.3.0


For SparkSQL ,when we  open AE by 'set spark.sql.adapative.enable=true'，the 
ExchangeCoordinator will  introduced to determine the number of post-shuffle 
partitions. But in some certain conditions,the coordinator performed not very 
well, there are always some tasks retained and they worked with Shuffle Read 
Size / Records  0.0B/0 ,We could increase the 
spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but  this 
action is  unreasonable as targetPostShuffleInputSize Should not be set too 
large. 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26536) Upgrade Mockito to 2.23.4

2019-01-04 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26536.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun  (was: Apache Spark)
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23452

> Upgrade Mockito to 2.23.4
> -
>
> Key: SPARK-26536
> URL: https://issues.apache.org/jira/browse/SPARK-26536
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> This issue upgrades Mockito from 1.10.19 to 2.23.4. The following changes are 
> required.
> - Replace `org.mockito.Matchers` with `org.mockito.ArgumentMatchers`
> - Replace `anyObject` with `any`
> - Replace `getArgumentAt` with `getArgument` and add type annotation.
> - Use `isNull` matcher in case of `null` is invoked.
> {code}
>  saslHandler.channelInactive(null);
> -verify(handler).channelInactive(any(TransportClient.class));
> +verify(handler).channelInactive(isNull());
> {code}
> - Make and use `doReturn` wrapper to avoid 
> [SI-4775|https://issues.scala-lang.org/browse/SI-4775]
> {code}
> private def doReturn(value: Any) = org.mockito.Mockito.doReturn(value, 
> Seq.empty: _*)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26541) Add `-Pdocker-integration-tests` to `dev/scalastyle`

2019-01-04 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26541:


Assignee: Apache Spark

> Add `-Pdocker-integration-tests` to `dev/scalastyle`
> 
>
> Key: SPARK-26541
> URL: https://issues.apache.org/jira/browse/SPARK-26541
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> This issue makes `scalastyle` to check `docker-integration-tests` module and 
> fixes one error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26541) Add `-Pdocker-integration-tests` to `dev/scalastyle`

2019-01-04 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26541:


Assignee: (was: Apache Spark)

> Add `-Pdocker-integration-tests` to `dev/scalastyle`
> 
>
> Key: SPARK-26541
> URL: https://issues.apache.org/jira/browse/SPARK-26541
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue makes `scalastyle` to check `docker-integration-tests` module and 
> fixes one error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26541) Add `-Pdocker-integration-tests` to `dev/scalastyle`

2019-01-04 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-26541:
-

 Summary: Add `-Pdocker-integration-tests` to `dev/scalastyle`
 Key: SPARK-26541
 URL: https://issues.apache.org/jira/browse/SPARK-26541
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


This issue makes `scalastyle` to check `docker-integration-tests` module and 
fixes one error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26540:
--
Affects Version/s: 2.0.2
   2.1.3

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26540:


Assignee: (was: Apache Spark)

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26540:


Assignee: Apache Spark

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734731#comment-16734731
 ] 

Dongjoon Hyun commented on SPARK-26540:
---

Thank you for checking, [~maropu]. I made a PR and will add PostgreSQL 
integration test for this.
- https://github.com/apache/spark/pull/23458

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734727#comment-16734727
 ] 

Takeshi Yamamuro edited comment on SPARK-26540 at 1/5/19 1:42 AM:
--

I checked the query failed in v2.0, but passed in v1.6.
{code}
// spark-1.6.3
scala> val pgTable = sqlContext.read.jdbc("jdbc:postgresql:postgres", "t", 
options)
scala> pgTable.printSchema
root
 |-- v: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
 |-- d: decimal(38,18) (nullable = true)

scala> pgTable.show
+++
|   v|   d|
+++
|[.222...|222.45550...|
+++
{code}


was (Author: maropu):
yea, the workaround is good.
The query failed in v2.0, but passed in v1.6.
{code}
// spark-1.6.3
scala> val pgTable = sqlContext.read.jdbc("jdbc:postgresql:postgres", "t", 
options)
scala> pgTable.printSchema
root
 |-- v: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
 |-- d: decimal(38,18) (nullable = true)

scala> pgTable.show
+++
|   v|   d|
+++
|[.222...|222.45550...|
+++
{code}

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734719#comment-16734719
 ] 

Dongjoon Hyun edited comment on SPARK-26540 at 1/5/19 1:34 AM:
---

Given that Apache Spark handles the decimal type with precision and scale, the 
reported case looks like a corner case.


was (Author: dongjoon):
Given that Apache Spark handles the decimal type with precision and scale, the 
reported case looks like a corner case. And, I believe this will be an minor 
improvement issue to cover that corner case.

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734727#comment-16734727
 ] 

Takeshi Yamamuro commented on SPARK-26540:
--

yea, the workaround is good.
The query failed in v2.0, but passed in v1.6.
{code}
// spark-1.6.3
scala> val pgTable = sqlContext.read.jdbc("jdbc:postgresql:postgres", "t", 
options)
scala> pgTable.printSchema
root
 |-- v: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
 |-- d: decimal(38,18) (nullable = true)

scala> pgTable.show
+++
|   v|   d|
+++
|[.222...|222.45550...|
+++
{code}

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734719#comment-16734719
 ] 

Dongjoon Hyun edited comment on SPARK-26540 at 1/5/19 1:41 AM:
---

Given that Apache Spark handles the decimal type with precision and scale, the 
reported case looks like a corner case.

Sorry, I found that the original email also described this situation. I missed 
that. Yep, I agree. I'll make a PR for this one.


was (Author: dongjoon):
Given that Apache Spark handles the decimal type with precision and scale, the 
reported case looks like a corner case.

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26540:
--
Issue Type: Bug  (was: Improvement)

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734719#comment-16734719
 ] 

Dongjoon Hyun commented on SPARK-26540:
---

Given that Apache Spark handles the decimal type with precision and scale, the 
reported case looks like a corner case. And, I believe this will be an minor 
improvement issue to cover that corner case.

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26540:
--
Issue Type: Improvement  (was: Bug)

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26540:
--
Priority: Minor  (was: Major)

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734718#comment-16734718
 ] 

Dongjoon Hyun commented on SPARK-26540:
---

[~maropu].

Please specify the precision and scale. Then, it will work.

{code}
CREATE TABLE t (v numeric(7,3)[], d  numeric);
{code}

{code}
scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
pgTable: org.apache.spark.sql.DataFrame = [v: array, d: 
decimal(38,18)]

scala> pgTable.show
+++
|   v|   d|
+++
|[.222, .332]|222.45550...|
+++
{code}

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734715#comment-16734715
 ] 

Dongjoon Hyun commented on SPARK-26540:
---

I'm checking the information PostgresJDBC ResultSet. I'll post the result here, 
[~maropu].

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734715#comment-16734715
 ] 

Dongjoon Hyun edited comment on SPARK-26540 at 1/5/19 12:58 AM:


I'm checking the information from PostgresJDBC ResultSet. I'll post the result 
here, [~maropu].


was (Author: dongjoon):
I'm checking the information PostgresJDBC ResultSet. I'll post the result here, 
[~maropu].

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734708#comment-16734708
 ] 

Dongjoon Hyun commented on SPARK-26540:
---

For me, this is not a correctness issue and not a regression. Could you check 
the older Spark like 2.0/2.1?

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734708#comment-16734708
 ] 

Dongjoon Hyun edited comment on SPARK-26540 at 1/5/19 12:41 AM:


For me, this is not a correctness issue and not a regression. Could you check 
the older Spark like 2.0/2.1? I guess this is also not a data loss issue.


was (Author: dongjoon):
For me, this is not a correctness issue and not a regression. Could you check 
the older Spark like 2.0/2.1?

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734701#comment-16734701
 ] 

Takeshi Yamamuro commented on SPARK-26540:
--

[~dongjoon] [~smilegator] could you check if this issue is worth blocking the 
release processes? Could you check?

> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala> options.setProperty("user", "maropu")
> scala> options.setProperty("password", "")
> scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
> scala> pgTable.printSchema
> root
>  |-- v: array (nullable = true)
>  ||-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true)
> scala> pgTable.show
> 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
> ...
> {code}
> I looked over the related code and then I think we need more logics to handle 
> numeric arrays;
> https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Takeshi Yamamuro (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-26540:
-
Description: 
This bug was reported in spark-user: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html

To reproduce this;
{code}
// Creates a table in a PostgreSQL shell
postgres=# CREATE TABLE t (v numeric[], d  numeric);
CREATE TABLE
postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
INSERT 0 1
postgres=# SELECT * FROM t;
  v  |d 
-+--
 {.222,.332} | 222.4555
(1 row)

postgres=# \d t
Table "public.t"
 Column |   Type| Modifiers 
+---+---
 v  | numeric[] | 
 d  | numeric   | 

// Then, reads it in Spark
./bin/spark-shell --jars=postgresql-42.2.4.jar -v

scala> import java.util.Properties
scala> val options = new Properties();
scala> options.setProperty("driver", "org.postgresql.Driver")
scala> options.setProperty("user", "maropu")
scala> options.setProperty("password", "")
scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
scala> pgTable.printSchema
root
 |-- v: array (nullable = true)
 ||-- element: decimal(0,0) (containsNull = true)
 |-- d: decimal(38,18) (nullable = true)

scala> pgTable.show
9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
exceeds max precision 0
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
...
{code}

I looked over the related code and then I think we need more logics to handle 
numeric arrays;
https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41
 

  was:
This bug was reported in spark-user: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html

To reproduce this;
{code}
// Creates a table in a PostgreSQL shell
postgres=# CREATE TABLE t (v numeric[], d  numeric);
CREATE TABLE
postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
INSERT 0 1
postgres=# SELECT * FROM t;
  v  |d 
-+--
 {.222,.332} | 222.4555
(1 row)

postgres=# \d t
Table "public.t"
 Column |   Type| Modifiers 
+---+---
 v  | numeric[] | 
 d  | numeric   | 

// Then, reads it in Spark
./bin/spark-shell --jars=postgresql-42.2.4.jar -v

scala> import java.util.Properties
scala> val options = new Properties();
scala> options.setProperty("driver", "org.postgresql.Driver")
scala> options.setProperty("user", "maropu")
scala> options.setProperty("password", "")
scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
scala> pgTable.printSchema
root
 |-- v: array (nullable = true)
 ||-- element: decimal(0,0) (containsNull = true)
 |-- d: decimal(38,18) (nullable = true)

scala> pgTable.show
9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
exceeds max precision 0
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
...
{code}


> Requirement failed when reading numeric arrays from PostgreSQL
> --
>
> Key: SPARK-26540
> URL: https://issues.apache.org/jira/browse/SPARK-26540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This bug was reported in spark-user: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html
> To reproduce this;
> {code}
> // Creates a table in a PostgreSQL shell
> postgres=# CREATE TABLE t (v numeric[], d  numeric);
> CREATE TABLE
> postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
> INSERT 0 1
> postgres=# SELECT * FROM t;
>   v  |d 
> -+--
>  {.222,.332} | 222.4555
> (1 row)
> postgres=# \d t
> Table "public.t"
>  Column |   Type| Modifiers 
> +---+---
>  v  | numeric[] | 
>  d  | numeric   | 
> // Then, reads it in Spark
> ./bin/spark-shell --jars=postgresql-42.2.4.jar -v
> scala> import java.util.Properties
> scala> val options = new Properties();
> scala> options.setProperty("driver", "org.postgresql.Driver")
> scala>

[jira] [Created] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL

2019-01-04 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created SPARK-26540:


 Summary: Requirement failed when reading numeric arrays from 
PostgreSQL
 Key: SPARK-26540
 URL: https://issues.apache.org/jira/browse/SPARK-26540
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 2.3.2, 2.2.2
Reporter: Takeshi Yamamuro


This bug was reported in spark-user: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html

To reproduce this;
{code}
// Creates a table in a PostgreSQL shell
postgres=# CREATE TABLE t (v numeric[], d  numeric);
CREATE TABLE
postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555);
INSERT 0 1
postgres=# SELECT * FROM t;
  v  |d 
-+--
 {.222,.332} | 222.4555
(1 row)

postgres=# \d t
Table "public.t"
 Column |   Type| Modifiers 
+---+---
 v  | numeric[] | 
 d  | numeric   | 

// Then, reads it in Spark
./bin/spark-shell --jars=postgresql-42.2.4.jar -v

scala> import java.util.Properties
scala> val options = new Properties();
scala> options.setProperty("driver", "org.postgresql.Driver")
scala> options.setProperty("user", "maropu")
scala> options.setProperty("password", "")
scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
scala> pgTable.printSchema
root
 |-- v: array (nullable = true)
 ||-- element: decimal(0,0) (containsNull = true)
 |-- d: decimal(38,18) (nullable = true)

scala> pgTable.show
9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
exceeds max precision 0
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
...
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26078) WHERE .. IN fails to filter rows when used in combination with UNION

2019-01-04 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26078:
--
Fix Version/s: 2.4.1

> WHERE .. IN fails to filter rows when used in combination with UNION
> 
>
> Key: SPARK-26078
> URL: https://issues.apache.org/jira/browse/SPARK-26078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Arttu Voutilainen
>Assignee: Marco Gaido
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.1, 3.0.0
>
>
> Hey,
> We encountered a case where Spark SQL does not seem to handle WHERE .. IN 
> correctly, when used in combination with UNION, but instead returns also rows 
> that do not fulfill the condition. Swapping the order of the datasets in the 
> UNION makes the problem go away. Repro below:
>  
> {code}
> sql = SQLContext(sc)
> a = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> b = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> a.registerTempTable('a')
> b.registerTempTable('b')
> bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'a' as source FROM a
> UNION ALL
> SELECT id, num, 'b' as source FROM b
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> no_bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'b' as source FROM b
> UNION ALL
> SELECT id, num, 'a' as source FROM a
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> bug.show()
> no_bug.show()
> bug.explain(True)
> no_bug.explain(True)
> {code}
> This results in one extra row in the "bug" DF coming from DF "b", that should 
> not be there as it  
> {code:java}
> >>> bug.show()
> +---+---+--+
> | id|num|source|
> +---+---+--+
> |  a|  2| a|
> |  a|  2| b|
> |  b|  1| b|
> +---+---+--+
> >>> no_bug.show()
> +---+---+--+
> | id|num|source|
> +---+---+--+
> |  a|  2| b|
> |  a|  2| a|
> +---+---+--+
> {code}
>  The reason can be seen in the query plans:
> {code:java}
> >>> bug.explain(True)
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#0, num#1L, a AS source#136]
> :  +- Join LeftSemi, (id#0 = id#4)
> : :- LogicalRDD [id#0, num#1L], false
> : +- Project [id#4]
> :+- Filter (isnotnull(num#5L) && (num#5L = 2))
> :   +- LogicalRDD [id#4, num#5L], false
> +- Join LeftSemi, (id#4#172 = id#4#172)
>:- Project [id#4, num#5L, b AS source#137]
>:  +- LogicalRDD [id#4, num#5L], false
>+- Project [id#4 AS id#4#172]
>   +- Filter (isnotnull(num#5L) && (num#5L = 2))
>  +- LogicalRDD [id#4, num#5L], false
> {code}
> Note the line *+- Join LeftSemi, (id#4#172 = id#4#172)* - this condition 
> seems wrong, and I believe it causes the LeftSemi to return true for all rows 
> in the left-hand-side table, thus failing to filter as the WHERE .. IN 
> should. Compare with the non-buggy version, where both LeftSemi joins have 
> distinct #-things on both sides:
> {code:java}
> >>> no_bug.explain()
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#4, num#5L, b AS source#142]
> :  +- Join LeftSemi, (id#4 = id#4#173)
> : :- LogicalRDD [id#4, num#5L], false
> : +- Project [id#4 AS id#4#173]
> :+- Filter (isnotnull(num#5L) && (num#5L = 2))
> :   +- LogicalRDD [id#4, num#5L], false
> +- Project [id#0, num#1L, a AS source#143]
>+- Join LeftSemi, (id#0 = id#4#173)
>   :- LogicalRDD [id#0, num#1L], false
>   +- Project [id#4 AS id#4#173]
>  +- Filter (isnotnull(num#5L) && (num#5L = 2))
> +- LogicalRDD [id#4, num#5L], false
> {code}
>  
> Best,
> -Arttu 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26306) Flaky test: org.apache.spark.util.collection.SorterSuite

2019-01-04 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26306.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23425
[https://github.com/apache/spark/pull/23425]

> Flaky test: org.apache.spark.util.collection.SorterSuite
> 
>
> Key: SPARK-26306
> URL: https://issues.apache.org/jira/browse/SPARK-26306
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 3.0.0
>
>
> TestTimSort causes OOM frequently.
> {code:java}
> [info] org.apache.spark.util.collection.SorterSuite *** ABORTED *** (3 
> seconds, 225 milliseconds)
> [info]   java.lang.OutOfMemoryError: Java heap space
> [info]   at 
> org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56)
> [info]   at 
> org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43)
> [info]   at 
> org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70)
> [info]   at 
> org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown
>  Source)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
> [info]   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
> [info]   at 
> org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
> [info]   at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown 
> Source)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
> [info]   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
> [info]   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
> [info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
> [info]   at org.scalatest.FunSuiteLike$$Lambda$128/398936629.apply(Unknown 
> Source)
> [info]   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
> [info]   at org.scalatest.SuperEngine$$Lambda$129/1905082148.apply(Unknown 
> Source)
> [info]   at scala.collection.immutable.List.foreach(List.scala:388)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
> [info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
> [info]   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
> [info]   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
> [info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
> [info]   at org.scalatest.Suite.run(Suite.scala:1147)
> [info]   at org.scalatest.Suite.run$(Suite.scala:1129)
> [error] Uncaught exception when running 
> org.apache.spark.util.collection.SorterSuite: java.lang.OutOfMemoryError: 
> Java heap space
> sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56)
>   at 
> org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43)
>   at 
> org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70)
>   at 
> org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown
>  Source)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown 
> Source)
>   at

[jira] [Resolved] (SPARK-26503) Get rid of spark.sql.legacy.timeParser.enabled

2019-01-04 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26503.
---
Resolution: Won't Fix

See PR discussion

> Get rid of spark.sql.legacy.timeParser.enabled
> --
>
> Key: SPARK-26503
> URL: https://issues.apache.org/jira/browse/SPARK-26503
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The flag is used in CSV/JSON datasources as well in time related functions to 
> control parsing/formatting date/timestamps. By default, DateTimeFormat is 
> used for the purpose but the flag allow switch back to SimpleDateFormat and 
> some fallback. In the major release 3.0, the flag should be removed, and new 
> formatters DateFormatter/TimestampFormatter should be used by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26306) Flaky test: org.apache.spark.util.collection.SorterSuite

2019-01-04 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26306:
-

Assignee: Sean Owen

> Flaky test: org.apache.spark.util.collection.SorterSuite
> 
>
> Key: SPARK-26306
> URL: https://issues.apache.org/jira/browse/SPARK-26306
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Sean Owen
>Priority: Minor
>
> TestTimSort causes OOM frequently.
> {code:java}
> [info] org.apache.spark.util.collection.SorterSuite *** ABORTED *** (3 
> seconds, 225 milliseconds)
> [info]   java.lang.OutOfMemoryError: Java heap space
> [info]   at 
> org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56)
> [info]   at 
> org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43)
> [info]   at 
> org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70)
> [info]   at 
> org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown
>  Source)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
> [info]   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
> [info]   at 
> org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
> [info]   at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown 
> Source)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
> [info]   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
> [info]   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
> [info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
> [info]   at org.scalatest.FunSuiteLike$$Lambda$128/398936629.apply(Unknown 
> Source)
> [info]   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
> [info]   at org.scalatest.SuperEngine$$Lambda$129/1905082148.apply(Unknown 
> Source)
> [info]   at scala.collection.immutable.List.foreach(List.scala:388)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
> [info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
> [info]   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
> [info]   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
> [info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
> [info]   at org.scalatest.Suite.run(Suite.scala:1147)
> [info]   at org.scalatest.Suite.run$(Suite.scala:1129)
> [error] Uncaught exception when running 
> org.apache.spark.util.collection.SorterSuite: java.lang.OutOfMemoryError: 
> Java heap space
> sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56)
>   at 
> org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43)
>   at 
> org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70)
>   at 
> org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown
>  Source)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown 
> Source)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at

[jira] [Assigned] (SPARK-26539) Remove spark.memory.useLegacyMode memory settings + StaticMemoryManager

2019-01-04 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26539:


Assignee: Sean Owen  (was: Apache Spark)

> Remove spark.memory.useLegacyMode memory settings + StaticMemoryManager
> ---
>
> Key: SPARK-26539
> URL: https://issues.apache.org/jira/browse/SPARK-26539
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>
> The old memory manager was superseded by UnifiedMemoryManager in Spark 1.6, 
> and has been the default since. I think we could remove it to simplify the 
> code a little, but more importantly to reduce the variety of memory settings 
> users are confronted with when using Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26539) Remove spark.memory.useLegacyMode memory settings + StaticMemoryManager

2019-01-04 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26539:
--
Docs Text: The legacy memory manager, enabled by 
spark.memory.useLegacyMode, has been removed in Spark 3. It is effectively 
always false, its default since Spark 1.6. Associated settings 
spark.shuffle.memoryFraction, spark.storage.memoryFraction,and 
spark.storage.unrollFraction have been removed as well.
   Labels: release-notes  (was: )

I should say I think this is still open to discussion, but per a quick dev@ 
thread about it, didn't hear strong objections.

> Remove spark.memory.useLegacyMode memory settings + StaticMemoryManager
> ---
>
> Key: SPARK-26539
> URL: https://issues.apache.org/jira/browse/SPARK-26539
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>  Labels: release-notes
>
> The old memory manager was superseded by UnifiedMemoryManager in Spark 1.6, 
> and has been the default since. I think we could remove it to simplify the 
> code a little, but more importantly to reduce the variety of memory settings 
> users are confronted with when using Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26539) Remove spark.memory.useLegacyMode memory settings + StaticMemoryManager

2019-01-04 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26539:


Assignee: Apache Spark  (was: Sean Owen)

> Remove spark.memory.useLegacyMode memory settings + StaticMemoryManager
> ---
>
> Key: SPARK-26539
> URL: https://issues.apache.org/jira/browse/SPARK-26539
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Major
>
> The old memory manager was superseded by UnifiedMemoryManager in Spark 1.6, 
> and has been the default since. I think we could remove it to simplify the 
> code a little, but more importantly to reduce the variety of memory settings 
> users are confronted with when using Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26539) Remove spark.memory.useLegacyMode memory settings + StaticMemoryManager

2019-01-04 Thread Sean Owen (JIRA)

Sean Owen created SPARK-26539:
-

 Summary: Remove spark.memory.useLegacyMode memory settings + 
StaticMemoryManager
 Key: SPARK-26539
 URL: https://issues.apache.org/jira/browse/SPARK-26539
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Sean Owen
Assignee: Sean Owen


The old memory manager was superseded by UnifiedMemoryManager in Spark 1.6, and 
has been the default since. I think we could remove it to simplify the code a 
little, but more importantly to reduce the variety of memory settings users are 
confronted with when using Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26538) Postgres numeric array support

2019-01-04 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26538:


Assignee: (was: Apache Spark)

> Postgres numeric array support
> --
>
> Key: SPARK-26538
> URL: https://issues.apache.org/jira/browse/SPARK-26538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.1
> Environment: PostgreSQL 10.4, 9.6.9.
>Reporter: Oleksii
>Priority: Minor
>
> Consider the following table definition:
> {code:sql}
> create table test1
> (
>    v  numeric[],
>    d  numeric
> );
> insert into test1 values('{.222,.332}', 222.4555);
> {code}
> When reading the table into a Dataframe, I get the following schema:
> {noformat}
> root
>  |-- v: array (nullable = true)
>  |    |-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true){noformat}
> Notice that for both columns precision and scale were not specified, but in 
> case of the array element I got both set to 0, while in the other case 
> defaults were set.
> Later, when I try to read the Dataframe, I get the following error:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>         at scala.Predef$.require(Predef.scala:224)
>         at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114)
>         at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:453)
>         at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$16$$anonfun$apply$6$$anonfun$apply$7.apply(JdbcUtils.scala:474)
>         ...{noformat}
> I would expect to get array elements of type decimal(38,18) and no error when 
> reading in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26538) Postgres numeric array support

2019-01-04 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26538:


Assignee: Apache Spark

> Postgres numeric array support
> --
>
> Key: SPARK-26538
> URL: https://issues.apache.org/jira/browse/SPARK-26538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.1
> Environment: PostgreSQL 10.4, 9.6.9.
>Reporter: Oleksii
>Assignee: Apache Spark
>Priority: Minor
>
> Consider the following table definition:
> {code:sql}
> create table test1
> (
>    v  numeric[],
>    d  numeric
> );
> insert into test1 values('{.222,.332}', 222.4555);
> {code}
> When reading the table into a Dataframe, I get the following schema:
> {noformat}
> root
>  |-- v: array (nullable = true)
>  |    |-- element: decimal(0,0) (containsNull = true)
>  |-- d: decimal(38,18) (nullable = true){noformat}
> Notice that for both columns precision and scale were not specified, but in 
> case of the array element I got both set to 0, while in the other case 
> defaults were set.
> Later, when I try to read the Dataframe, I get the following error:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
> exceeds max precision 0
>         at scala.Predef$.require(Predef.scala:224)
>         at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114)
>         at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:453)
>         at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$16$$anonfun$apply$6$$anonfun$apply$7.apply(JdbcUtils.scala:474)
>         ...{noformat}
> I would expect to get array elements of type decimal(38,18) and no error when 
> reading in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26538) Postgres numeric array support

2019-01-04 Thread Oleksii (JIRA)

Oleksii created SPARK-26538:
---

 Summary: Postgres numeric array support
 Key: SPARK-26538
 URL: https://issues.apache.org/jira/browse/SPARK-26538
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2, 2.2.2, 2.4.1
 Environment: PostgreSQL 10.4, 9.6.9.
Reporter: Oleksii


Consider the following table definition:


{code:sql}
create table test1
(
   v  numeric[],
   d  numeric
);

insert into test1 values('{.222,.332}', 222.4555);
{code}

When reading the table into a Dataframe, I get the following schema:


{noformat}
root
 |-- v: array (nullable = true)
 |    |-- element: decimal(0,0) (containsNull = true)
 |-- d: decimal(38,18) (nullable = true){noformat}

Notice that for both columns precision and scale were not specified, but in 
case of the array element I got both set to 0, while in the other case defaults 
were set.

Later, when I try to read the Dataframe, I get the following error:


{noformat}
java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 
exceeds max precision 0
        at scala.Predef$.require(Predef.scala:224)
        at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114)
        at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:453)
        at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$16$$anonfun$apply$6$$anonfun$apply$7.apply(JdbcUtils.scala:474)
        ...{noformat}


I would expect to get array elements of type decimal(38,18) and no error when 
reading in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26537) update the release scripts to point to gitbox

2019-01-04 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734540#comment-16734540
 ] 

shane knapp commented on SPARK-26537:
-

[~srowen] i'm pretty sure that my change to the root pom is fine, but a second 
set of eyes would be great.  :)

> update the release scripts to point to gitbox
> -
>
> Key: SPARK-26537
> URL: https://issues.apache.org/jira/browse/SPARK-26537
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.3.0, 2.4.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> we're seeing packaging build failures like this:  
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console]
> i did a quick skim through the repo, and found the offending urls to the old 
> apache git repos:
>  
> {code:java}
> (py35) ➜ spark git:(update-apache-repo) grep -r git-wip *
> dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git"
> dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git;
> dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git;
> pom.xml: 
> scm:git:https://git-wip-us.apache.org/repos/asf/spark.git
> {code}
> this affects all versions of spark, so it will need to be backported to all 
> released versions.
> i'll put together a pull request later today.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26537) update the release scripts to point to gitbox

2019-01-04 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26537:


Assignee: shane knapp  (was: Apache Spark)

> update the release scripts to point to gitbox
> -
>
> Key: SPARK-26537
> URL: https://issues.apache.org/jira/browse/SPARK-26537
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.3.0, 2.4.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> we're seeing packaging build failures like this:  
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console]
> i did a quick skim through the repo, and found the offending urls to the old 
> apache git repos:
>  
> {code:java}
> (py35) ➜ spark git:(update-apache-repo) grep -r git-wip *
> dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git"
> dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git;
> dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git;
> pom.xml: 
> scm:git:https://git-wip-us.apache.org/repos/asf/spark.git
> {code}
> this affects all versions of spark, so it will need to be backported to all 
> released versions.
> i'll put together a pull request later today.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26537) update the release scripts to point to gitbox

2019-01-04 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26537:


Assignee: Apache Spark  (was: shane knapp)

> update the release scripts to point to gitbox
> -
>
> Key: SPARK-26537
> URL: https://issues.apache.org/jira/browse/SPARK-26537
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.3.0, 2.4.0
>Reporter: shane knapp
>Assignee: Apache Spark
>Priority: Major
>
> we're seeing packaging build failures like this:  
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console]
> i did a quick skim through the repo, and found the offending urls to the old 
> apache git repos:
>  
> {code:java}
> (py35) ➜ spark git:(update-apache-repo) grep -r git-wip *
> dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git"
> dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git;
> dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git;
> pom.xml: 
> scm:git:https://git-wip-us.apache.org/repos/asf/spark.git
> {code}
> this affects all versions of spark, so it will need to be backported to all 
> released versions.
> i'll put together a pull request later today.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26537) update the release scripts to point to gitbox

2019-01-04 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734531#comment-16734531
 ] 

Dongjoon Hyun edited comment on SPARK-26537 at 1/4/19 8:20 PM:
---

Thank you for investigating this, [~shaneknapp].


was (Author: dongjoon):
Thank you for investing this, [~shaneknapp].

> update the release scripts to point to gitbox
> -
>
> Key: SPARK-26537
> URL: https://issues.apache.org/jira/browse/SPARK-26537
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.3.0, 2.4.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> we're seeing packaging build failures like this:  
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console]
> i did a quick skim through the repo, and found the offending urls to the old 
> apache git repos:
>  
> {code:java}
> (py35) ➜ spark git:(update-apache-repo) grep -r git-wip *
> dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git"
> dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git;
> dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git;
> pom.xml: 
> scm:git:https://git-wip-us.apache.org/repos/asf/spark.git
> {code}
> this affects all versions of spark, so it will need to be backported to all 
> released versions.
> i'll put together a pull request later today.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26537) update the release scripts to point to gitbox

2019-01-04 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734531#comment-16734531
 ] 

Dongjoon Hyun commented on SPARK-26537:
---

Thank you for investing this, [~shaneknapp].

> update the release scripts to point to gitbox
> -
>
> Key: SPARK-26537
> URL: https://issues.apache.org/jira/browse/SPARK-26537
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.3.0, 2.4.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> we're seeing packaging build failures like this:  
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console]
> i did a quick skim through the repo, and found the offending urls to the old 
> apache git repos:
>  
> {code:java}
> (py35) ➜ spark git:(update-apache-repo) grep -r git-wip *
> dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git"
> dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git;
> dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git;
> pom.xml: 
> scm:git:https://git-wip-us.apache.org/repos/asf/spark.git
> {code}
> this affects all versions of spark, so it will need to be backported to all 
> released versions.
> i'll put together a pull request later today.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26537) update the release scripts to point to gitbox

2019-01-04 Thread shane knapp (JIRA)

shane knapp created SPARK-26537:
---

 Summary: update the release scripts to point to gitbox
 Key: SPARK-26537
 URL: https://issues.apache.org/jira/browse/SPARK-26537
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.0, 2.3.0, 2.1.0, 2.0.0, 1.6.0
Reporter: shane knapp
Assignee: shane knapp


we're seeing packaging build failures like this:  
[https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console]

i did a quick skim through the repo, and found the offending urls to the old 
apache git repos:

 
{code:java}
(py35) ➜ spark git:(update-apache-repo) grep -r git-wip *
dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git"
dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git;
dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git;
pom.xml: 
scm:git:https://git-wip-us.apache.org/repos/asf/spark.git
{code}
this affects all versions of spark, so it will need to be backported to all 
released versions.

i'll put together a pull request later today.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19809) NullPointerException on zero-size ORC file

2019-01-04 Thread Prashanth Sandela (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734510#comment-16734510
 ] 

Prashanth Sandela edited comment on SPARK-19809 at 1/4/19 7:40 PM:
---

[~dongjoon] I'm encountering same similar issue with spark version 2.3.1

I'm trying to read from a table which was ingested by sqoop. There are few 0 
byte files for this table. The file sizes looks like below: 
{noformat}
-rw-rw-r-- 3 cloud-user root 17.3 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-0
-rw-rw-r-- 3 cloud-user root 10.3 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-1
-rw-rw-r-- 3 cloud-user root 19.9 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-2
-rw-rw-r-- 3 cloud-user root 13.0 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-3
-rw-rw-r-- 3 cloud-user root 0 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-4
-rw-rw-r-- 3 cloud-user root 3.4 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-5
-rw-rw-r-- 3 cloud-user root 13.8 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-6
-rw-rw-r-- 3 cloud-user root 0 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-7
-rw-rw-r-- 3 cloud-user root 0 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-8
-rw-rw-r-- 3 cloud-user root 6.9 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-9
-rw-rw-r-- 3 cloud-user root 9.0 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-00010
-rw-rw-r-- 3 cloud-user root 11.4 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-00011
-rw-rw-r-- 3 cloud-user root 14.7 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-00012
-rw-rw-r-- 3 cloud-user root 17.4 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-00013
-rw-rw-r-- 3 cloud-user root 17.1 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-00014{noformat}
 

Spark throws exception while reading this table.
{noformat}
scala> spark.read.table("table_with_few_zero_byte_files").show() 
java.lang.RuntimeException: serious problem at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
 at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at 
scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at 
scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at 
scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at 
scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at 
scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at 
scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)

[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file

2019-01-04 Thread Prashanth Sandela (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734523#comment-16734523
 ] 

Prashanth Sandela commented on SPARK-19809:
---

Awesome! This config of `spark.sql.orc.impl=native` works. Thanks for quick 
response. 

> NullPointerException on zero-size ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1
>Reporter: Michał Dawid
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: image-2018-02-26-20-29-49-410.png, 
> spark.sql.hive.convertMetastoreOrc.txt
>
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at

[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file

2019-01-04 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734521#comment-16734521
 ] 

Dongjoon Hyun commented on SPARK-19809:
---

Hi, did you use `spark.sql.orc.impl=native`, too? New ORC is not default in 
Spark 2.3.x.

> NullPointerException on zero-size ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1
>Reporter: Michał Dawid
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: image-2018-02-26-20-29-49-410.png, 
> spark.sql.hive.convertMetastoreOrc.txt
>
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at

[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file

2019-01-04 Thread Prashanth Sandela (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734510#comment-16734510
 ] 

Prashanth Sandela commented on SPARK-19809:
---

[~dongjoon] I'm encountering same similar issue with spark version 2.3.1

I'm trying to read from a table which was ingested by sqoop. There are few 0 
byte files for this table. The file sizes looks like below: 
{noformat}
-rw-rw-r-- 3 cloud-user root 17.3 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-0
-rw-rw-r-- 3 cloud-user root 10.3 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-1
-rw-rw-r-- 3 cloud-user root 19.9 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-2
-rw-rw-r-- 3 cloud-user root 13.0 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-3
-rw-rw-r-- 3 cloud-user root 0 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-4
-rw-rw-r-- 3 cloud-user root 3.4 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-5
-rw-rw-r-- 3 cloud-user root 13.8 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-6
-rw-rw-r-- 3 cloud-user root 0 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-7
-rw-rw-r-- 3 cloud-user root 0 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-8
-rw-rw-r-- 3 cloud-user root 6.9 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-9
-rw-rw-r-- 3 cloud-user root 9.0 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-00010
-rw-rw-r-- 3 cloud-user root 11.4 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-00011
-rw-rw-r-- 3 cloud-user root 14.7 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-00012
-rw-rw-r-- 3 cloud-user root 17.4 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-00013
-rw-rw-r-- 3 cloud-user root 17.1 M 2019-01-03 22:20 
/apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-00014{noformat}
 

Spark throws exception while doing count
{noformat}
scala> spark.read.table("table_with_few_zero_byte_files").show() 
java.lang.RuntimeException: serious problem at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
 at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at 
scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at 
scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at 
scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at 
scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at 
scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at 
scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at 
scala.Option.getOrElse(Option.scala:121) at

[jira] [Commented] (SPARK-26254) Move delegation token providers into a separate project

2019-01-04 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734393#comment-16734393
 ] 

Gabor Somogyi commented on SPARK-26254:
---

[~hyukjin.kwon] In my last comment right before the ping almost everything is 
clear but don't know what is the suggestion related kafka.

> Move delegation token providers into a separate project
> ---
>
> Key: SPARK-26254
> URL: https://issues.apache.org/jira/browse/SPARK-26254
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> There was a discussion in 
> [PR#22598|https://github.com/apache/spark/pull/22598] that there are several 
> provided dependencies inside core project which shouldn't be there (for ex. 
> hive and kafka). This jira is to solve this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26089) Handle large corrupt shuffle blocks

2019-01-04 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26089:


Assignee: Apache Spark

> Handle large corrupt shuffle blocks
> ---
>
> Key: SPARK-26089
> URL: https://issues.apache.org/jira/browse/SPARK-26089
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Major
>
> We've seen a bad disk lead to corruption in a shuffle block, which lead to 
> tasks repeatedly failing after fetching the data with an IOException.  The 
> tasks get retried, but the same corrupt data gets fetched again, and the 
> tasks keep failing.  As there isn't a fetch-failure, the jobs eventually 
> fail, spark never tries to regenerate the shuffle data.
> This is the same as SPARK-4105, but that fix only covered small blocks.  
> There was some discussion during that change about this limitation 
> (https://github.com/apache/spark/pull/15923#discussion_r88756017) and 
> followups to cover larger blocks (which would involve spilling to disk to 
> avoid OOM), but it looks like that never happened.
> I can think of a few approaches to this:
> 1) wrap the shuffle block input stream with another input stream, that 
> converts all exceptions into FetchFailures.  This is similar to the fix of 
> SPARK-4105, but that reads the entire input stream up-front, and instead I'm 
> proposing to do it within the InputStream itself so its streaming and does 
> not have a large memory overhead.
> 2) Add checksums to shuffle blocks.  This was proposed 
> [here|https://github.com/apache/spark/pull/15894] and abandoned as being too 
> complex.
> 3) Try to tackle this with blacklisting instead: when there is any failure in 
> a task that is reading shuffle data, assign some "blame" to the source of the 
> shuffle data, and eventually blacklist the source.  It seems really tricky to 
> get sensible heuristics for this, though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26089) Handle large corrupt shuffle blocks

2019-01-04 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26089:


Assignee: (was: Apache Spark)

> Handle large corrupt shuffle blocks
> ---
>
> Key: SPARK-26089
> URL: https://issues.apache.org/jira/browse/SPARK-26089
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>
> We've seen a bad disk lead to corruption in a shuffle block, which lead to 
> tasks repeatedly failing after fetching the data with an IOException.  The 
> tasks get retried, but the same corrupt data gets fetched again, and the 
> tasks keep failing.  As there isn't a fetch-failure, the jobs eventually 
> fail, spark never tries to regenerate the shuffle data.
> This is the same as SPARK-4105, but that fix only covered small blocks.  
> There was some discussion during that change about this limitation 
> (https://github.com/apache/spark/pull/15923#discussion_r88756017) and 
> followups to cover larger blocks (which would involve spilling to disk to 
> avoid OOM), but it looks like that never happened.
> I can think of a few approaches to this:
> 1) wrap the shuffle block input stream with another input stream, that 
> converts all exceptions into FetchFailures.  This is similar to the fix of 
> SPARK-4105, but that reads the entire input stream up-front, and instead I'm 
> proposing to do it within the InputStream itself so its streaming and does 
> not have a large memory overhead.
> 2) Add checksums to shuffle blocks.  This was proposed 
> [here|https://github.com/apache/spark/pull/15894] and abandoned as being too 
> complex.
> 3) Try to tackle this with blacklisting instead: when there is any failure in 
> a task that is reading shuffle data, assign some "blame" to the source of the 
> shuffle data, and eventually blacklist the source.  It seems really tricky to 
> get sensible heuristics for this, though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26362) Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark contexts

2019-01-04 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734371#comment-16734371
 ] 

Hyukjin Kwon commented on SPARK-26362:
--

Sure will do.

> Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark 
> contexts
> ---
>
> Key: SPARK-26362
> URL: https://issues.apache.org/jira/browse/SPARK-26362
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: releasenotes
> Fix For: 3.0.0
>
>
> Multiple Spark contexts are discouraged and it has been warning from 4 years 
> ago (see SPARK-4180).
> It could cause arbitrary and mysterious error cases. (Honestly, I didn't even 
> know Spark allows it). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-26366) Except with transform regression

2019-01-04 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26366:

Comment: was deleted

(was: mgaido91 opened a new pull request #23372: 
[SPARK-26366][SQL][BACKPORT-2.3] ReplaceExceptWithFilter should consider NULL 
as False
URL: https://github.com/apache/spark/pull/23372
 
 
   ## What changes were proposed in this pull request?
   
   In `ReplaceExceptWithFilter` we do not consider properly the case in which 
the condition returns NULL. Indeed, in that case, since negating NULL still 
returns NULL, so it is not true the assumption that negating the condition 
returns all the rows which didn't satisfy it, rows returning NULL may not be 
returned. This happens when constraints inferred by 
`InferFiltersFromConstraints` are not enough, as it happens with `OR` 
conditions.
   
   The rule had also problems with non-deterministic conditions: in such a 
scenario, this rule would change the probability of the output.
   
   The PR fixes these problem by:
- returning False for the condition when it is Null (in this way we do 
return all the rows which didn't satisfy it);
- avoiding any transformation when the condition is non-deterministic.
   
   ## How was this patch tested?
   
   added UTs
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org
)

> Except with transform regression
> 
>
> Key: SPARK-26366
> URL: https://issues.apache.org/jira/browse/SPARK-26366
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.2
>Reporter: Dan Osipov
>Assignee: Marco Gaido
>Priority: Major
>  Labels: correctness
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> There appears to be a regression between Spark 2.2 and 2.3. Below is the code 
> to reproduce it:
>  
> {code:java}
> import org.apache.spark.sql.functions.col
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val inputDF = spark.sqlContext.createDataFrame(
>   spark.sparkContext.parallelize(Seq(
> Row("0", "john", "smith", "j...@smith.com"),
> Row("1", "jane", "doe", "j...@doe.com"),
> Row("2", "apache", "spark", "sp...@apache.org"),
> Row("3", "foo", "bar", null)
>   )),
>   StructType(List(
> StructField("id", StringType, nullable=true),
> StructField("first_name", StringType, nullable=true),
> StructField("last_name", StringType, nullable=true),
> StructField("email", StringType, nullable=true)
>   ))
> )
> val exceptDF = inputDF.transform( toProcessDF =>
>   toProcessDF.filter(
>   (
> col("first_name").isin(Seq("john", "jane"): _*)
>   and col("last_name").isin(Seq("smith", "doe"): _*)
>   )
>   or col("email").isin(List(): _*)
>   )
> )
> inputDF.except(exceptDF).show()
> {code}
> Output with Spark 2.2:
> {noformat}
> +---+--+-++
> | id|first_name|last_name| email|
> +---+--+-++
> | 2| apache| spark|sp...@apache.org|
> | 3| foo| bar| null|
> +---+--+-++{noformat}
> Output with Spark 2.3:
> {noformat}
> +---+--+-++
> | id|first_name|last_name| email|
> +---+--+-++
> | 2| apache| spark|sp...@apache.org|
> +---+--+-++{noformat}
> Note, changing the last line to 
> {code:java}
> inputDF.except(exceptDF.cache()).show()
> {code}
> produces identical output for both Spark 2.3 and 2.2
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26362) Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark contexts

2019-01-04 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26362:
--
Docs Text: Support for multiple Spark contexts in the same driver has been 
removed in Spark 3.

> Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark 
> contexts
> ---
>
> Key: SPARK-26362
> URL: https://issues.apache.org/jira/browse/SPARK-26362
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.0.0
>
>
> Multiple Spark contexts are discouraged and it has been warning from 4 years 
> ago (see SPARK-4180).
> It could cause arbitrary and mysterious error cases. (Honestly, I didn't even 
> know Spark allows it). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26362) Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark contexts

2019-01-04 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26362:
--
Priority: Minor  (was: Major)

> Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark 
> contexts
> ---
>
> Key: SPARK-26362
> URL: https://issues.apache.org/jira/browse/SPARK-26362
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: releasenotes
> Fix For: 3.0.0
>
>
> Multiple Spark contexts are discouraged and it has been warning from 4 years 
> ago (see SPARK-4180).
> It could cause arbitrary and mysterious error cases. (Honestly, I didn't even 
> know Spark allows it). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26362) Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark contexts

2019-01-04 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26362:

Labels: releasenotes  (was: )

> Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark 
> contexts
> ---
>
> Key: SPARK-26362
> URL: https://issues.apache.org/jira/browse/SPARK-26362
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.0.0
>
>
> Multiple Spark contexts are discouraged and it has been warning from 4 years 
> ago (see SPARK-4180).
> It could cause arbitrary and mysterious error cases. (Honestly, I didn't even 
> know Spark allows it). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-26362) Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark contexts

2019-01-04 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26362:

Comment: was deleted

(was: asfgit closed pull request #23311: [SPARK-26362][CORE] Remove 
'spark.driver.allowMultipleContexts' to disallow multiple creation of 
SparkContexts
URL: https://github.com/apache/spark/pull/23311
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala 
b/core/src/main/scala/org/apache/spark/SparkContext.scala
index 696dafda6d1ec..09cc346db0ed2 100644
--- a/core/src/main/scala/org/apache/spark/SparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -64,9 +64,8 @@ import org.apache.spark.util.logging.DriverLogger
  * Main entry point for Spark functionality. A SparkContext represents the 
connection to a Spark
  * cluster, and can be used to create RDDs, accumulators and broadcast 
variables on that cluster.
  *
- * Only one SparkContext may be active per JVM.  You must `stop()` the active 
SparkContext before
- * creating a new one.  This limitation may eventually be removed; see 
SPARK-2243 for more details.
- *
+ * @note Only one `SparkContext` should be active per JVM. You must `stop()` 
the
+ *   active `SparkContext` before creating a new one.
  * @param config a Spark Config object describing the application 
configuration. Any settings in
  *   this config overrides the default configs as well as system properties.
  */
@@ -75,14 +74,10 @@ class SparkContext(config: SparkConf) extends Logging {
   // The call site where this SparkContext was constructed.
   private val creationSite: CallSite = Utils.getCallSite()
 
-  // If true, log warnings instead of throwing exceptions when multiple 
SparkContexts are active
-  private val allowMultipleContexts: Boolean =
-config.getBoolean("spark.driver.allowMultipleContexts", false)
-
   // In order to prevent multiple SparkContexts from being active at the same 
time, mark this
   // context as having started construction.
   // NOTE: this must be placed at the beginning of the SparkContext 
constructor.
-  SparkContext.markPartiallyConstructed(this, allowMultipleContexts)
+  SparkContext.markPartiallyConstructed(this)
 
   val startTime = System.currentTimeMillis()
 
@@ -2392,7 +2387,7 @@ class SparkContext(config: SparkConf) extends Logging {
   // In order to prevent multiple SparkContexts from being active at the same 
time, mark this
   // context as having finished construction.
   // NOTE: this must be placed at the end of the SparkContext constructor.
-  SparkContext.setActiveContext(this, allowMultipleContexts)
+  SparkContext.setActiveContext(this)
 }
 
 /**
@@ -2409,18 +2404,18 @@ object SparkContext extends Logging {
   private val SPARK_CONTEXT_CONSTRUCTOR_LOCK = new Object()
 
   /**
-   * The active, fully-constructed SparkContext.  If no SparkContext is 
active, then this is `null`.
+   * The active, fully-constructed SparkContext. If no SparkContext is active, 
then this is `null`.
*
-   * Access to this field is guarded by SPARK_CONTEXT_CONSTRUCTOR_LOCK.
+   * Access to this field is guarded by `SPARK_CONTEXT_CONSTRUCTOR_LOCK`.
*/
   private val activeContext: AtomicReference[SparkContext] =
 new AtomicReference[SparkContext](null)
 
   /**
-   * Points to a partially-constructed SparkContext if some thread is in the 
SparkContext
+   * Points to a partially-constructed SparkContext if another thread is in 
the SparkContext
* constructor, or `None` if no SparkContext is being constructed.
*
-   * Access to this field is guarded by SPARK_CONTEXT_CONSTRUCTOR_LOCK
+   * Access to this field is guarded by `SPARK_CONTEXT_CONSTRUCTOR_LOCK`.
*/
   private var contextBeingConstructed: Option[SparkContext] = None
 
@@ -2428,24 +2423,16 @@ object SparkContext extends Logging {
* Called to ensure that no other SparkContext is running in this JVM.
*
* Throws an exception if a running context is detected and logs a warning 
if another thread is
-   * constructing a SparkContext.  This warning is necessary because the 
current locking scheme
+   * constructing a SparkContext. This warning is necessary because the 
current locking scheme
* prevents us from reliably distinguishing between cases where another 
context is being
* constructed and cases where another constructor threw an exception.
*/
-  private def assertNoOtherContextIsRunning(
-  sc: SparkContext,
-  allowMultipleContexts: Boolean): Unit = {
+  private def assertNoOtherContextIsRunning(sc: SparkContext): Unit = {
 SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {

[jira] [Issue Comment Deleted] (SPARK-26366) Except with transform regression

2019-01-04 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26366:

Comment: was deleted

(was: mgaido91 opened a new pull request #23350: 
[SPARK-26366][SQL][BACKPORT-2.3] ReplaceExceptWithFilter should consider NULL 
as False
URL: https://github.com/apache/spark/pull/23350
 
 
   ## What changes were proposed in this pull request?
   
   In `ReplaceExceptWithFilter` we do not consider properly the case in which 
the condition returns NULL. Indeed, in that case, since negating NULL still 
returns NULL, so it is not true the assumption that negating the condition 
returns all the rows which didn't satisfy it, rows returning NULL may not be 
returned. This happens when constraints inferred by 
`InferFiltersFromConstraints` are not enough, as it happens with `OR` 
conditions.
   
   The rule had also problems with non-deterministic conditions: in such a 
scenario, this rule would change the probability of the output.
   
   The PR fixes these problem by:
- returning False for the condition when it is Null (in this way we do 
return all the rows which didn't satisfy it);
- avoiding any transformation when the condition is non-deterministic.
   
   ## How was this patch tested?
   
   added UTs
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org
)

> Except with transform regression
> 
>
> Key: SPARK-26366
> URL: https://issues.apache.org/jira/browse/SPARK-26366
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.2
>Reporter: Dan Osipov
>Assignee: Marco Gaido
>Priority: Major
>  Labels: correctness
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> There appears to be a regression between Spark 2.2 and 2.3. Below is the code 
> to reproduce it:
>  
> {code:java}
> import org.apache.spark.sql.functions.col
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val inputDF = spark.sqlContext.createDataFrame(
>   spark.sparkContext.parallelize(Seq(
> Row("0", "john", "smith", "j...@smith.com"),
> Row("1", "jane", "doe", "j...@doe.com"),
> Row("2", "apache", "spark", "sp...@apache.org"),
> Row("3", "foo", "bar", null)
>   )),
>   StructType(List(
> StructField("id", StringType, nullable=true),
> StructField("first_name", StringType, nullable=true),
> StructField("last_name", StringType, nullable=true),
> StructField("email", StringType, nullable=true)
>   ))
> )
> val exceptDF = inputDF.transform( toProcessDF =>
>   toProcessDF.filter(
>   (
> col("first_name").isin(Seq("john", "jane"): _*)
>   and col("last_name").isin(Seq("smith", "doe"): _*)
>   )
>   or col("email").isin(List(): _*)
>   )
> )
> inputDF.except(exceptDF).show()
> {code}
> Output with Spark 2.2:
> {noformat}
> +---+--+-++
> | id|first_name|last_name| email|
> +---+--+-++
> | 2| apache| spark|sp...@apache.org|
> | 3| foo| bar| null|
> +---+--+-++{noformat}
> Output with Spark 2.3:
> {noformat}
> +---+--+-++
> | id|first_name|last_name| email|
> +---+--+-++
> | 2| apache| spark|sp...@apache.org|
> +---+--+-++{noformat}
> Note, changing the last line to 
> {code:java}
> inputDF.except(exceptDF.cache()).show()
> {code}
> produces identical output for both Spark 2.3 and 2.2
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-26366) Except with transform regression

2019-01-04 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26366:

Comment: was deleted

(was: mgaido91 opened a new pull request #23315: [SPARK-26366][SQL] 
ReplaceExceptWithFilter should consider NULL as False
URL: https://github.com/apache/spark/pull/23315
 
 
   ## What changes were proposed in this pull request?
   
   In `ReplaceExceptWithFilter` we do not consider the case in which the 
condition returns NULL. Indeed, in that case, negating NULL still returns NULL, 
so it is not true the assumption that negating the condition returns all the 
rows which didn't satisfy it: rows returning NULL are not returned.
   
   The PR fixes this problem by returning False for the condition when it is 
Null. In this way we do return all the rows which didn't satisfy it.
   
   ## How was this patch tested?
   
   added UTs


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org
)

> Except with transform regression
> 
>
> Key: SPARK-26366
> URL: https://issues.apache.org/jira/browse/SPARK-26366
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.2
>Reporter: Dan Osipov
>Assignee: Marco Gaido
>Priority: Major
>  Labels: correctness
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> There appears to be a regression between Spark 2.2 and 2.3. Below is the code 
> to reproduce it:
>  
> {code:java}
> import org.apache.spark.sql.functions.col
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val inputDF = spark.sqlContext.createDataFrame(
>   spark.sparkContext.parallelize(Seq(
> Row("0", "john", "smith", "j...@smith.com"),
> Row("1", "jane", "doe", "j...@doe.com"),
> Row("2", "apache", "spark", "sp...@apache.org"),
> Row("3", "foo", "bar", null)
>   )),
>   StructType(List(
> StructField("id", StringType, nullable=true),
> StructField("first_name", StringType, nullable=true),
> StructField("last_name", StringType, nullable=true),
> StructField("email", StringType, nullable=true)
>   ))
> )
> val exceptDF = inputDF.transform( toProcessDF =>
>   toProcessDF.filter(
>   (
> col("first_name").isin(Seq("john", "jane"): _*)
>   and col("last_name").isin(Seq("smith", "doe"): _*)
>   )
>   or col("email").isin(List(): _*)
>   )
> )
> inputDF.except(exceptDF).show()
> {code}
> Output with Spark 2.2:
> {noformat}
> +---+--+-++
> | id|first_name|last_name| email|
> +---+--+-++
> | 2| apache| spark|sp...@apache.org|
> | 3| foo| bar| null|
> +---+--+-++{noformat}
> Output with Spark 2.3:
> {noformat}
> +---+--+-++
> | id|first_name|last_name| email|
> +---+--+-++
> | 2| apache| spark|sp...@apache.org|
> +---+--+-++{noformat}
> Note, changing the last line to 
> {code:java}
> inputDF.except(exceptDF.cache()).show()
> {code}
> produces identical output for both Spark 2.3 and 2.2
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-26366) Except with transform regression

2019-01-04 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26366:

Comment: was deleted

(was: gatorsmile closed pull request #23350: [SPARK-26366][SQL][BACKPORT-2.3] 
ReplaceExceptWithFilter should consider NULL as False
URL: https://github.com/apache/spark/pull/23350
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceExceptWithFilter.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceExceptWithFilter.scala
index 45edf266bbce4..08cf16038a654 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceExceptWithFilter.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceExceptWithFilter.scala
@@ -36,7 +36,8 @@ import org.apache.spark.sql.catalyst.rules.Rule
  * Note:
  * Before flipping the filter condition of the right node, we should:
  * 1. Combine all it's [[Filter]].
- * 2. Apply InferFiltersFromConstraints rule (to take into account of NULL 
values in the condition).
+ * 2. Update the attribute references to the left node;
+ * 3. Add a Coalesce(condition, False) (to take into account of NULL values in 
the condition).
  */
 object ReplaceExceptWithFilter extends Rule[LogicalPlan] {
 
@@ -47,23 +48,28 @@ object ReplaceExceptWithFilter extends Rule[LogicalPlan] {
 
 plan.transform {
   case e @ Except(left, right) if isEligible(left, right) =>
-val newCondition = transformCondition(left, skipProject(right))
-newCondition.map { c =>
-  Distinct(Filter(Not(c), left))
-}.getOrElse {
+val filterCondition = 
combineFilters(skipProject(right)).asInstanceOf[Filter].condition
+if (filterCondition.deterministic) {
+  transformCondition(left, filterCondition).map { c =>
+Distinct(Filter(Not(c), left))
+  }.getOrElse {
+e
+  }
+} else {
   e
 }
 }
   }
 
-  private def transformCondition(left: LogicalPlan, right: LogicalPlan): 
Option[Expression] = {
-val filterCondition =
-  
InferFiltersFromConstraints(combineFilters(right)).asInstanceOf[Filter].condition
-
-val attributeNameMap: Map[String, Attribute] = left.output.map(x => 
(x.name, x)).toMap
-
-if (filterCondition.references.forall(r => 
attributeNameMap.contains(r.name))) {
-  Some(filterCondition.transform { case a: AttributeReference => 
attributeNameMap(a.name) })
+  private def transformCondition(plan: LogicalPlan, condition: Expression): 
Option[Expression] = {
+val attributeNameMap: Map[String, Attribute] = plan.output.map(x => 
(x.name, x)).toMap
+if (condition.references.forall(r => attributeNameMap.contains(r.name))) {
+  val rewrittenCondition = condition.transform {
+case a: AttributeReference => attributeNameMap(a.name)
+  }
+  // We need to consider as False when the condition is NULL, otherwise we 
do not return those
+  // rows containing NULL which are instead filtered in the Except right 
plan
+  Some(Coalesce(Seq(rewrittenCondition, Literal.FalseLiteral)))
 } else {
   None
 }
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala
index 52dc2e9fb076c..78d3969906e99 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala
@@ -20,11 +20,12 @@ package org.apache.spark.sql.catalyst.optimizer
 import org.apache.spark.sql.Row
 import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.dsl.plans._
-import org.apache.spark.sql.catalyst.expressions.{Alias, Literal, Not}
+import org.apache.spark.sql.catalyst.expressions.{Alias, Coalesce, If, 
Literal, Not}
 import org.apache.spark.sql.catalyst.expressions.aggregate.First
 import org.apache.spark.sql.catalyst.plans.{LeftAnti, LeftSemi, PlanTest}
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules.RuleExecutor
+import org.apache.spark.sql.types.BooleanType
 
 class ReplaceOperatorSuite extends PlanTest {
 
@@ -65,8 +66,7 @@ class ReplaceOperatorSuite extends PlanTest {
 
 val correctAnswer =
   Aggregate(table1.output, table1.output,
-Filter(Not((attributeA.isNotNull && attributeB.isNotNull) &&
-  (attributeA >= 2 && attributeB < 1)),
+

[jira] [Commented] (SPARK-26366) Except with transform regression

2019-01-04 Thread Reynold Xin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734363#comment-16734363
 ] 

Reynold Xin commented on SPARK-26366:
-

Please make sure you guys tag these tickets with correctness label!

> Except with transform regression
> 
>
> Key: SPARK-26366
> URL: https://issues.apache.org/jira/browse/SPARK-26366
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.2
>Reporter: Dan Osipov
>Assignee: Marco Gaido
>Priority: Major
>  Labels: correctness
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> There appears to be a regression between Spark 2.2 and 2.3. Below is the code 
> to reproduce it:
>  
> {code:java}
> import org.apache.spark.sql.functions.col
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val inputDF = spark.sqlContext.createDataFrame(
>   spark.sparkContext.parallelize(Seq(
> Row("0", "john", "smith", "j...@smith.com"),
> Row("1", "jane", "doe", "j...@doe.com"),
> Row("2", "apache", "spark", "sp...@apache.org"),
> Row("3", "foo", "bar", null)
>   )),
>   StructType(List(
> StructField("id", StringType, nullable=true),
> StructField("first_name", StringType, nullable=true),
> StructField("last_name", StringType, nullable=true),
> StructField("email", StringType, nullable=true)
>   ))
> )
> val exceptDF = inputDF.transform( toProcessDF =>
>   toProcessDF.filter(
>   (
> col("first_name").isin(Seq("john", "jane"): _*)
>   and col("last_name").isin(Seq("smith", "doe"): _*)
>   )
>   or col("email").isin(List(): _*)
>   )
> )
> inputDF.except(exceptDF).show()
> {code}
> Output with Spark 2.2:
> {noformat}
> +---+--+-++
> | id|first_name|last_name| email|
> +---+--+-++
> | 2| apache| spark|sp...@apache.org|
> | 3| foo| bar| null|
> +---+--+-++{noformat}
> Output with Spark 2.3:
> {noformat}
> +---+--+-++
> | id|first_name|last_name| email|
> +---+--+-++
> | 2| apache| spark|sp...@apache.org|
> +---+--+-++{noformat}
> Note, changing the last line to 
> {code:java}
> inputDF.except(exceptDF.cache()).show()
> {code}
> produces identical output for both Spark 2.3 and 2.2
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26366) Except with transform regression

2019-01-04 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26366:

Labels: correctness  (was: )

> Except with transform regression
> 
>
> Key: SPARK-26366
> URL: https://issues.apache.org/jira/browse/SPARK-26366
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.2
>Reporter: Dan Osipov
>Assignee: Marco Gaido
>Priority: Major
>  Labels: correctness
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> There appears to be a regression between Spark 2.2 and 2.3. Below is the code 
> to reproduce it:
>  
> {code:java}
> import org.apache.spark.sql.functions.col
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val inputDF = spark.sqlContext.createDataFrame(
>   spark.sparkContext.parallelize(Seq(
> Row("0", "john", "smith", "j...@smith.com"),
> Row("1", "jane", "doe", "j...@doe.com"),
> Row("2", "apache", "spark", "sp...@apache.org"),
> Row("3", "foo", "bar", null)
>   )),
>   StructType(List(
> StructField("id", StringType, nullable=true),
> StructField("first_name", StringType, nullable=true),
> StructField("last_name", StringType, nullable=true),
> StructField("email", StringType, nullable=true)
>   ))
> )
> val exceptDF = inputDF.transform( toProcessDF =>
>   toProcessDF.filter(
>   (
> col("first_name").isin(Seq("john", "jane"): _*)
>   and col("last_name").isin(Seq("smith", "doe"): _*)
>   )
>   or col("email").isin(List(): _*)
>   )
> )
> inputDF.except(exceptDF).show()
> {code}
> Output with Spark 2.2:
> {noformat}
> +---+--+-++
> | id|first_name|last_name| email|
> +---+--+-++
> | 2| apache| spark|sp...@apache.org|
> | 3| foo| bar| null|
> +---+--+-++{noformat}
> Output with Spark 2.3:
> {noformat}
> +---+--+-++
> | id|first_name|last_name| email|
> +---+--+-++
> | 2| apache| spark|sp...@apache.org|
> +---+--+-++{noformat}
> Note, changing the last line to 
> {code:java}
> inputDF.except(exceptDF.cache()).show()
> {code}
> produces identical output for both Spark 2.3 and 2.2
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-26366) Except with transform regression

2019-01-04 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26366:

Comment: was deleted

(was: asfgit closed pull request #23315: [SPARK-26366][SQL] 
ReplaceExceptWithFilter should consider NULL as False
URL: https://github.com/apache/spark/pull/23315
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceExceptWithFilter.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceExceptWithFilter.scala
index efd3944eba7f5..4996d24dfd298 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceExceptWithFilter.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceExceptWithFilter.scala
@@ -36,7 +36,8 @@ import org.apache.spark.sql.catalyst.rules.Rule
  * Note:
  * Before flipping the filter condition of the right node, we should:
  * 1. Combine all it's [[Filter]].
- * 2. Apply InferFiltersFromConstraints rule (to take into account of NULL 
values in the condition).
+ * 2. Update the attribute references to the left node;
+ * 3. Add a Coalesce(condition, False) (to take into account of NULL values in 
the condition).
  */
 object ReplaceExceptWithFilter extends Rule[LogicalPlan] {
 
@@ -47,23 +48,28 @@ object ReplaceExceptWithFilter extends Rule[LogicalPlan] {
 
 plan.transform {
   case e @ Except(left, right, false) if isEligible(left, right) =>
-val newCondition = transformCondition(left, skipProject(right))
-newCondition.map { c =>
-  Distinct(Filter(Not(c), left))
-}.getOrElse {
+val filterCondition = 
combineFilters(skipProject(right)).asInstanceOf[Filter].condition
+if (filterCondition.deterministic) {
+  transformCondition(left, filterCondition).map { c =>
+Distinct(Filter(Not(c), left))
+  }.getOrElse {
+e
+  }
+} else {
   e
 }
 }
   }
 
-  private def transformCondition(left: LogicalPlan, right: LogicalPlan): 
Option[Expression] = {
-val filterCondition =
-  
InferFiltersFromConstraints(combineFilters(right)).asInstanceOf[Filter].condition
-
-val attributeNameMap: Map[String, Attribute] = left.output.map(x => 
(x.name, x)).toMap
-
-if (filterCondition.references.forall(r => 
attributeNameMap.contains(r.name))) {
-  Some(filterCondition.transform { case a: AttributeReference => 
attributeNameMap(a.name) })
+  private def transformCondition(plan: LogicalPlan, condition: Expression): 
Option[Expression] = {
+val attributeNameMap: Map[String, Attribute] = plan.output.map(x => 
(x.name, x)).toMap
+if (condition.references.forall(r => attributeNameMap.contains(r.name))) {
+  val rewrittenCondition = condition.transform {
+case a: AttributeReference => attributeNameMap(a.name)
+  }
+  // We need to consider as False when the condition is NULL, otherwise we 
do not return those
+  // rows containing NULL which are instead filtered in the Except right 
plan
+  Some(Coalesce(Seq(rewrittenCondition, Literal.FalseLiteral)))
 } else {
   None
 }
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala
index 3b1b2d588ef67..c8e15c7da763e 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala
@@ -20,11 +20,12 @@ package org.apache.spark.sql.catalyst.optimizer
 import org.apache.spark.sql.Row
 import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.dsl.plans._
-import org.apache.spark.sql.catalyst.expressions.{Alias, Literal, Not}
+import org.apache.spark.sql.catalyst.expressions.{Alias, Coalesce, If, 
Literal, Not}
 import org.apache.spark.sql.catalyst.expressions.aggregate.First
 import org.apache.spark.sql.catalyst.plans.{LeftAnti, LeftSemi, PlanTest}
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules.RuleExecutor
+import org.apache.spark.sql.types.BooleanType
 
 class ReplaceOperatorSuite extends PlanTest {
 
@@ -65,8 +66,7 @@ class ReplaceOperatorSuite extends PlanTest {
 
 val correctAnswer =
   Aggregate(table1.output, table1.output,
-Filter(Not((attributeA.isNotNull && attributeB.isNotNull) &&
-  (attributeA >= 2 && attributeB < 1)),
+

[jira] [Issue Comment Deleted] (SPARK-26246) Infer timestamp types from JSON

2019-01-04 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26246:

Comment: was deleted

(was: asfgit closed pull request #23201: [SPARK-26246][SQL] Inferring 
TimestampType from JSON
URL: https://github.com/apache/spark/pull/23201
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
index 263e05de32075..d1bc00c08c1c6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
@@ -28,7 +28,7 @@ import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.catalyst.analysis.TypeCoercion
 import org.apache.spark.sql.catalyst.expressions.ExprUtils
 import org.apache.spark.sql.catalyst.json.JacksonUtils.nextUntil
-import org.apache.spark.sql.catalyst.util.{DropMalformedMode, FailFastMode, 
ParseMode, PermissiveMode}
+import org.apache.spark.sql.catalyst.util._
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
 import org.apache.spark.util.Utils
@@ -37,6 +37,12 @@ private[sql] class JsonInferSchema(options: JSONOptions) 
extends Serializable {
 
   private val decimalParser = ExprUtils.getDecimalParser(options.locale)
 
+  @transient
+  private lazy val timestampFormatter = TimestampFormatter(
+options.timestampFormat,
+options.timeZone,
+options.locale)
+
   /**
* Infer the type of a collection of json records in three stages:
*   1. Infer the type of each record
@@ -115,13 +121,19 @@ private[sql] class JsonInferSchema(options: JSONOptions) 
extends Serializable {
 // record fields' types have been combined.
 NullType
 
-  case VALUE_STRING if options.prefersDecimal =>
+  case VALUE_STRING =>
+val field = parser.getText
 val decimalTry = allCatch opt {
-  val bigDecimal = decimalParser(parser.getText)
+  val bigDecimal = decimalParser(field)
 DecimalType(bigDecimal.precision, bigDecimal.scale)
 }
-decimalTry.getOrElse(StringType)
-  case VALUE_STRING => StringType
+if (options.prefersDecimal && decimalTry.isDefined) {
+  decimalTry.get
+} else if ((allCatch opt timestampFormatter.parse(field)).isDefined) {
+  TimestampType
+} else {
+  StringType
+}
 
   case START_OBJECT =>
 val builder = Array.newBuilder[StructField]
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JsonInferSchemaSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JsonInferSchemaSuite.scala
new file mode 100644
index 0..9307f9b47b807
--- /dev/null
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JsonInferSchemaSuite.scala
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.json
+
+import com.fasterxml.jackson.core.JsonFactory
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.plans.SQLHelper
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+
+class JsonInferSchemaSuite extends SparkFunSuite with SQLHelper {
+
+  def checkType(options: Map[String, String], json: String, dt: DataType): 
Unit = {
+val jsonOptions = new JSONOptions(options, "UTC", "")
+val inferSchema = new JsonInferSchema(jsonOptions)
+val factory = new JsonFactory()
+jsonOptions.setJacksonOptions(factory)
+val parser = CreateJacksonParser.string(factory, json)
+parser.nextToken()
+val expectedType = StructType(Seq(StructField("a", dt, true)))
+
+assert(inferSchema.inferField(parser) === expectedType)
+  }
+
+  def

[jira] [Comment Edited] (SPARK-26246) Infer timestamp types from JSON

2019-01-04 Thread Reynold Xin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734361#comment-16734361
 ] 

Reynold Xin edited comment on SPARK-26246 at 1/4/19 5:22 PM:
-

Is there an option flag for this? This is a breaking change for people, and we 
need a way to fallback.


was (Author: rxin):
|Is there an option flag for this? This is a breaking change for people, and we 
need a way to fallback.|
 
 
 

> Infer timestamp types from JSON
> ---
>
> Key: SPARK-26246
> URL: https://issues.apache.org/jira/browse/SPARK-26246
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently, TimestampType cannot be inferred from JSON. To parse JSON string, 
> you have to specify schema explicitly if JSON input contains timestamps. This 
> ticket aims to extend JsonInferSchema to support such inferring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26246) Infer timestamp types from JSON

2019-01-04 Thread Reynold Xin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734361#comment-16734361
 ] 

Reynold Xin commented on SPARK-26246:
-

|Is there an option flag for this? This is a breaking change for people, and we 
need a way to fallback.|
 
 
 

> Infer timestamp types from JSON
> ---
>
> Key: SPARK-26246
> URL: https://issues.apache.org/jira/browse/SPARK-26246
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently, TimestampType cannot be inferred from JSON. To parse JSON string, 
> you have to specify schema explicitly if JSON input contains timestamps. This 
> ticket aims to extend JsonInferSchema to support such inferring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26362) Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark contexts

2019-01-04 Thread Reynold Xin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734355#comment-16734355
 ] 

Reynold Xin commented on SPARK-26362:
-

[~hyukjin.kwon] please make sure we add releasenotes label to tickets like this.

> Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark 
> contexts
> ---
>
> Key: SPARK-26362
> URL: https://issues.apache.org/jira/browse/SPARK-26362
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.0.0
>
>
> Multiple Spark contexts are discouraged and it has been warning from 4 years 
> ago (see SPARK-4180).
> It could cause arbitrary and mysterious error cases. (Honestly, I didn't even 
> know Spark allows it). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26536) Upgrade Mockito to 2.23.4

2019-01-04 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26536:
--
Description: 
This issue upgrades Mockito from 1.10.19 to 2.23.4. The following changes are 
required.

- Replace `org.mockito.Matchers` with `org.mockito.ArgumentMatchers`
- Replace `anyObject` with `any`
- Replace `getArgumentAt` with `getArgument` and add type annotation.
- Use `isNull` matcher in case of `null` is invoked.
{code}
 saslHandler.channelInactive(null);
-verify(handler).channelInactive(any(TransportClient.class));
+verify(handler).channelInactive(isNull());
{code}

- Make and use `doReturn` wrapper to avoid 
[SI-4775|https://issues.scala-lang.org/browse/SI-4775]
{code}
private def doReturn(value: Any) = org.mockito.Mockito.doReturn(value, 
Seq.empty: _*)
{code}

> Upgrade Mockito to 2.23.4
> -
>
> Key: SPARK-26536
> URL: https://issues.apache.org/jira/browse/SPARK-26536
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> This issue upgrades Mockito from 1.10.19 to 2.23.4. The following changes are 
> required.
> - Replace `org.mockito.Matchers` with `org.mockito.ArgumentMatchers`
> - Replace `anyObject` with `any`
> - Replace `getArgumentAt` with `getArgument` and add type annotation.
> - Use `isNull` matcher in case of `null` is invoked.
> {code}
>  saslHandler.channelInactive(null);
> -verify(handler).channelInactive(any(TransportClient.class));
> +verify(handler).channelInactive(isNull());
> {code}
> - Make and use `doReturn` wrapper to avoid 
> [SI-4775|https://issues.scala-lang.org/browse/SI-4775]
> {code}
> private def doReturn(value: Any) = org.mockito.Mockito.doReturn(value, 
> Seq.empty: _*)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26536) Upgrade Mockito to 2.23.4

2019-01-04 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26536:


Assignee: (was: Apache Spark)

> Upgrade Mockito to 2.23.4
> -
>
> Key: SPARK-26536
> URL: https://issues.apache.org/jira/browse/SPARK-26536
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26536) Upgrade Mockito to 2.23.4

2019-01-04 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26536:


Assignee: Apache Spark

> Upgrade Mockito to 2.23.4
> -
>
> Key: SPARK-26536
> URL: https://issues.apache.org/jira/browse/SPARK-26536
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26536) Upgrade Mockito to 2.23.4

2019-01-04 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26536:


Assignee: Apache Spark

> Upgrade Mockito to 2.23.4
> -
>
> Key: SPARK-26536
> URL: https://issues.apache.org/jira/browse/SPARK-26536
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26536) Upgrade Mockito to 2.23.4

2019-01-04 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26536:
--
Component/s: Tests

> Upgrade Mockito to 2.23.4
> -
>
> Key: SPARK-26536
> URL: https://issues.apache.org/jira/browse/SPARK-26536
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26536) Upgrade Mockito to 2.23.4

2019-01-04 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-26536:
-

 Summary: Upgrade Mockito to 2.23.4
 Key: SPARK-26536
 URL: https://issues.apache.org/jira/browse/SPARK-26536
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26535) Parsing literals as DOUBLE instead of DECIMAL

2019-01-04 Thread Marco Gaido (JIRA)

Marco Gaido created SPARK-26535:
---

 Summary: Parsing literals as DOUBLE instead of DECIMAL
 Key: SPARK-26535
 URL: https://issues.apache.org/jira/browse/SPARK-26535
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Marco Gaido


As pointed out by [~dkbiswal]'s comment 
https://github.com/apache/spark/pull/22450#issuecomment-423082389, most of 
other RDBMS (DB2, Presto, Hive, MSSQL) consider literals as DOUBLE by default.

Spark as of now consider them as DECIMAL. This is quite problematic especially 
in relation with the operations on decimal, for which we base our 
implementation on Hive/MSSQL.

So this ticket is for moving by default the resolution of literals as DOUBLE, 
but with a config which allows to get back to the previous behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26436) Dataframe resulting from a GroupByKey and flatMapGroups operation throws java.lang.UnsupportedException when groupByKey is applied on it.

2019-01-04 Thread Manish (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734267#comment-16734267
 ] 

Manish commented on SPARK-26436:


[~viirya]: Thanks for the explanation. But should groupByKey on rows with 
GenericSchema not work, when a repartition on the same dataframe and same 
column succeeds. 

> Dataframe resulting from a GroupByKey and flatMapGroups operation throws 
> java.lang.UnsupportedException when groupByKey is applied on it.
> -
>
> Key: SPARK-26436
> URL: https://issues.apache.org/jira/browse/SPARK-26436
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Manish
>Priority: Major
>
> There seems to be a bug on groupByKey api for cases when it (groupByKey) is 
> applied on a DataSet resulting from a former groupByKey and flatMapGroups 
> invocation.
> In such cases groupByKey throws the following exception:
> java.lang.UnsupportedException: fieldIndex on a Row without schema is 
> undefined.
>  
> Although the dataframe has a valid schema and a groupBy("key") or 
> repartition($"key") api calls on the same Dataframe and key succeed.
>  
> Following is the code that reproduces the scenario:
>  
> {code:scala}
>  
>import org.apache.spark.sql.catalyst.encoders.RowEncoder
> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.types.{ IntegerType, StructField, StructType}
> import scala.collection.mutable.ListBuffer
>   object Test {
> def main(args: Array[String]): Unit = {
>   val values = List(List("1", "One") ,List("1", "Two") ,List("2", 
> "Three"),List("2","4")).map(x =>(x(0), x(1)))
>   val session = SparkSession.builder.config("spark.master", 
> "local").getOrCreate
>   import session.implicits._
>   val dataFrame = values.toDF
>   dataFrame.show()
>   dataFrame.printSchema()
>   val newSchema = StructType(dataFrame.schema.fields
> ++ Array(
> StructField("Count", IntegerType, false)
>   )
>   )
>   val expr = RowEncoder.apply(newSchema)
>   val tranform =  dataFrame.groupByKey(row => 
> row.getAs[String]("_1")).flatMapGroups((key, inputItr) => {
> val inputSeq = inputItr.toSeq
> val length = inputSeq.size
> var listBuff = new ListBuffer[Row]()
> var counter : Int= 0
> for(i <- 0 until(length))
> {
>   counter+=1
> }
> for(i <- 0 until length ) {
>   var x = inputSeq(i)
>   listBuff += Row.fromSeq(x.toSeq ++ Array[Int](counter))
> }
> listBuff.iterator
>   })(expr)
>   tranform.show
>   val newSchema1 = StructType(tranform.schema.fields
> ++ Array(
> StructField("Count1", IntegerType, false)
>   )
>   )
>   val expr1 = RowEncoder.apply(newSchema1)
>   val tranform2 =  tranform.groupByKey(row => 
> row.getAs[String]("_1")).flatMapGroups((key, inputItr) => {
> val inputSeq = inputItr.toSeq
> val length = inputSeq.size
> var listBuff = new ListBuffer[Row]()
> var counter : Int= 0
> for(i <- 0 until(length))
> {
>   counter+=1
> }
> for(i <- 0 until length ) {
>   var x = inputSeq(i)
>   listBuff += Row.fromSeq(x.toSeq ++ Array[Int](counter))
> }
> listBuff.iterator
>   })(expr1)
>   tranform2.show
> }
> }
> Test.main(null)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26436) Dataframe resulting from a GroupByKey and flatMapGroups operation throws java.lang.UnsupportedException when groupByKey is applied on it.

2019-01-04 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734282#comment-16734282
 ] 

Liang-Chi Hsieh commented on SPARK-26436:
-

Sorry I don't know what you mean "should groupByKey on rows with GenericSchema 
not work, when a repartition on the same dataframe and same column succeeds.". 
Can you elaborate it?

> Dataframe resulting from a GroupByKey and flatMapGroups operation throws 
> java.lang.UnsupportedException when groupByKey is applied on it.
> -
>
> Key: SPARK-26436
> URL: https://issues.apache.org/jira/browse/SPARK-26436
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Manish
>Priority: Major
>
> There seems to be a bug on groupByKey api for cases when it (groupByKey) is 
> applied on a DataSet resulting from a former groupByKey and flatMapGroups 
> invocation.
> In such cases groupByKey throws the following exception:
> java.lang.UnsupportedException: fieldIndex on a Row without schema is 
> undefined.
>  
> Although the dataframe has a valid schema and a groupBy("key") or 
> repartition($"key") api calls on the same Dataframe and key succeed.
>  
> Following is the code that reproduces the scenario:
>  
> {code:scala}
>  
>import org.apache.spark.sql.catalyst.encoders.RowEncoder
> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.types.{ IntegerType, StructField, StructType}
> import scala.collection.mutable.ListBuffer
>   object Test {
> def main(args: Array[String]): Unit = {
>   val values = List(List("1", "One") ,List("1", "Two") ,List("2", 
> "Three"),List("2","4")).map(x =>(x(0), x(1)))
>   val session = SparkSession.builder.config("spark.master", 
> "local").getOrCreate
>   import session.implicits._
>   val dataFrame = values.toDF
>   dataFrame.show()
>   dataFrame.printSchema()
>   val newSchema = StructType(dataFrame.schema.fields
> ++ Array(
> StructField("Count", IntegerType, false)
>   )
>   )
>   val expr = RowEncoder.apply(newSchema)
>   val tranform =  dataFrame.groupByKey(row => 
> row.getAs[String]("_1")).flatMapGroups((key, inputItr) => {
> val inputSeq = inputItr.toSeq
> val length = inputSeq.size
> var listBuff = new ListBuffer[Row]()
> var counter : Int= 0
> for(i <- 0 until(length))
> {
>   counter+=1
> }
> for(i <- 0 until length ) {
>   var x = inputSeq(i)
>   listBuff += Row.fromSeq(x.toSeq ++ Array[Int](counter))
> }
> listBuff.iterator
>   })(expr)
>   tranform.show
>   val newSchema1 = StructType(tranform.schema.fields
> ++ Array(
> StructField("Count1", IntegerType, false)
>   )
>   )
>   val expr1 = RowEncoder.apply(newSchema1)
>   val tranform2 =  tranform.groupByKey(row => 
> row.getAs[String]("_1")).flatMapGroups((key, inputItr) => {
> val inputSeq = inputItr.toSeq
> val length = inputSeq.size
> var listBuff = new ListBuffer[Row]()
> var counter : Int= 0
> for(i <- 0 until(length))
> {
>   counter+=1
> }
> for(i <- 0 until length ) {
>   var x = inputSeq(i)
>   listBuff += Row.fromSeq(x.toSeq ++ Array[Int](counter))
> }
> listBuff.iterator
>   })(expr1)
>   tranform2.show
> }
> }
> Test.main(null)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26534) Closure Cleaner Bug

2019-01-04 Thread sam (JIRA)

sam created SPARK-26534:
---

 Summary: Closure Cleaner Bug
 Key: SPARK-26534
 URL: https://issues.apache.org/jira/browse/SPARK-26534
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: sam


I've found a strange combination of closures where the closure cleaner doesn't 
seem to be smart enough to figure out how to remove a reference that is not 
used. I.e. we get a `org.apache.spark.SparkException: Task not serializable` 
for a Task that is perfectly serializable.  

 

In the example below, the only `val` that is actually needed for the closure of 
the `map` is `foo`, but it tries to serialise `thingy`.  What is odd is 
changing this code in a number of subtle ways eliminates the error, which I've 
tried to highlight using comments inline.

 
{code:java}
import org.apache.spark.sql._

object Test {
  val sparkSession: SparkSession =
SparkSession.builder.master("local").appName("app").getOrCreate()

  def apply(): Unit = {
import sparkSession.implicits._

val landedData: Dataset[String] = 
sparkSession.sparkContext.makeRDD(Seq("foo", "bar")).toDS()

// thingy has to be in this outer scope to reproduce, if in someFunc, 
cannot reproduce
val thingy: Thingy = new Thingy

// If not wrapped in someFunc cannot reproduce
val someFunc = () => {
  // If don't reference this foo inside the closer (e.g. just use identity 
function) cannot reproduce
  val foo: String = "foo"

  thingy.run(block = () => {
landedData.map(r => {
  r + foo
})
.count()
  })
}

someFunc()

  }
}

class Thingy {
  def run[R](block: () => R): R = {
block()
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8602) Shared cached DataFrames

2019-01-04 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-8602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734216#comment-16734216
 ] 

Hyukjin Kwon commented on SPARK-8602:
-

I think we now have one Spark context and multiple Spark sessions. If we use 
multiple Spark Sessions, then we're able to do this. Let me leave this 
resolved. Please reopen this if I am mistaken.

> Shared cached DataFrames
> 
>
> Key: SPARK-8602
> URL: https://issues.apache.org/jira/browse/SPARK-8602
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: John Muller
>Priority: Major
>
> Currently, the only way I can think of to share HiveContexts, SparkContexts, 
> or cached DataFrames is to use spark-jobserver and spark-jobserver-extras:
> https://gist.github.com/anonymous/578385766261d6fa7196#file-exampleshareddf-scala
> But HiveServer2 users over plain JDBC cannot access the shared dataframe. 
> Request is to add this directly to SparkSQL and treat it like a shared temp 
> table Ex. 
> SELECT a, b, c
> FROM TableA
> CACHE DATAFRAME
> This would be very useful for Rollups and Cubes, though I'm not sure what 
> this may mean for HiveMetaStore. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8427) Incorrect ACL checking for partitioned table in Spark SQL-1.4

2019-01-04 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-8427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-8427.
-
Resolution: Not A Problem

We're heading for Spark 3.0 and the issue was Spark 1.4. The codes have been 
drastically changed and the information here is obsolete. Please reopen if this 
problem is persistent in upper Spark versions as well.

> Incorrect ACL checking for partitioned table in Spark SQL-1.4
> -
>
> Key: SPARK-8427
> URL: https://issues.apache.org/jira/browse/SPARK-8427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: CentOS 6 & OS X 10.9.5, Hive-0.13.1, Spark-1.4, Hadoop 
> 2.6.0
>Reporter: Karthik Subramanian
>Priority: Critical
>  Labels: security
>
> Problem Statement:
> While doing query on a partitioned table using Spark SQL (Version 1.4.0), 
> access denied exception is observed on the partition the user doesn’t belong 
> to (The user permission is controlled using HDF ACLs). The same works 
> correctly in hive.
> Usercase: To address Multitenancy
> Consider a table containing multiple customers and each customer with 
> multiple facility. The table is partitioned by customer and facility. The 
> user belonging to on facility will not have access to other facility. This is 
> enforced using HDFS ACLs on corresponding directories. When querying on the 
> table as ‘user1’ belonging to ‘facility1’ and ‘customer1’ on the particular 
> partition (using ‘where’ clause) only the corresponding directory access 
> should be verified and not the entire table. 
> The above use case works as expected when using HIVE client, version 0.13.1 & 
> 1.1.0. 
> The query used: select count(*) from customertable where customer=‘customer1’ 
> and facility=‘facility1’
> Below is the exception received in Spark-shell:
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=user1, access=READ_EXECUTE, 
> inode="/data/customertable/customer=customer2/facility=facility2”:root:supergroup:drwxrwx---:group::r-x,group:facility2:rwx
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkAccessAcl(FSPermissionChecker.java:351)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:253)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:185)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6512)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6494)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPathAccess(FSNamesystem.java:6419)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListingInt(FSNamesystem.java:4954)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing(FSNamesystem.java:4915)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getListing(NameNodeRpcServer.java:826)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:612)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1971)
>   at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1952)
>   at 
>

[jira] [Commented] (SPARK-9218) Falls back to getAllPartitions when getPartitionsByFilter fails

2019-01-04 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-9218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734218#comment-16734218
 ] 

Hyukjin Kwon commented on SPARK-9218:
-

ping [~lian cheng]

> Falls back to getAllPartitions when getPartitionsByFilter fails
> ---
>
> Key: SPARK-9218
> URL: https://issues.apache.org/jira/browse/SPARK-9218
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Major
>
> [PR #7492|https://github.com/apache/spark/pull/7492] enables Hive partition 
> predicate push-down by leveraging {{Hive.getPartitionsByFilter}}. Although 
> this optimization is pretty effective, we did observe some failures like this:
> {noformat}
> java.sql.SQLDataException: Invalid character string format for type DECIMAL.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeQuery(Unknown Source)
>   at 
> com.jolbox.bonecp.PreparedStatementHandle.executeQuery(PreparedStatementHandle.java:174)
>   at 
> org.datanucleus.store.rdbms.ParamLoggingPreparedStatement.executeQuery(ParamLoggingPreparedStatement.java:381)
>   at 
> org.datanucleus.store.rdbms.SQLController.executeStatementQuery(SQLController.java:504)
>   at 
> org.datanucleus.store.rdbms.query.SQLQuery.performExecute(SQLQuery.java:280)
>   at org.datanucleus.store.query.Query.executeQuery(Query.java:1786)
>   at 
> org.datanucleus.store.query.AbstractSQLQuery.executeWithArray(AbstractSQLQuery.java:339)
>   at org.datanucleus.api.jdo.JDOQuery.executeWithArray(JDOQuery.java:312)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.getPartitionsViaSqlFilterInternal(MetaStoreDirectSql.java:300)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.getPartitionsViaSqlFilter(MetaStoreDirectSql.java:211)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore$4.getSqlResult(ObjectStore.java:2320)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore$4.getSqlResult(ObjectStore.java:2317)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore$GetHelper.run(ObjectStore.java:2208)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilterInternal(ObjectStore.java:2317)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilter(ObjectStore.java:2165)
>   at sun.reflect.GeneratedMethodAccessor126.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:108)
>   at com.sun.proxy.$Proxy21.getPartitionsByFilter(Unknown Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions_by_filter(HiveMetaStore.java:3760)
>   at sun.reflect.GeneratedMethodAccessor125.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
>   at com.sun.proxy.$Proxy23.get_partitions_by_filter(Unknown Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(HiveMetaStoreClient.java:903)
>   at sun.reflect.GeneratedMethodAccessor124.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
>   at com.sun.proxy.$Proxy24.listPartitionsByFilter(Unknown Source)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByFilter(Hive.java:1944)
>   at sun.reflect.GeneratedMethodAccessor123.invoke(Unknown Source)
>   at 
>

[jira] [Resolved] (SPARK-8370) Add API for data sources to register databases

2019-01-04 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-8370.
-
Resolution: Not A Problem

> Add API for data sources to register databases
> --
>
> Key: SPARK-8370
> URL: https://issues.apache.org/jira/browse/SPARK-8370
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Santiago M. Mola
>Priority: Major
>
> This API would allow to register a database with a data source instead of 
> just a table. Registering a data source database would register all its table 
> and maintain the catalog updated. The catalog could delegate to the data 
> source lookups of tables for a database registered with this API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8602) Shared cached DataFrames

2019-01-04 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-8602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-8602.
-
Resolution: Not A Problem

> Shared cached DataFrames
> 
>
> Key: SPARK-8602
> URL: https://issues.apache.org/jira/browse/SPARK-8602
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: John Muller
>Priority: Major
>
> Currently, the only way I can think of to share HiveContexts, SparkContexts, 
> or cached DataFrames is to use spark-jobserver and spark-jobserver-extras:
> https://gist.github.com/anonymous/578385766261d6fa7196#file-exampleshareddf-scala
> But HiveServer2 users over plain JDBC cannot access the shared dataframe. 
> Request is to add this directly to SparkSQL and treat it like a shared temp 
> table Ex. 
> SELECT a, b, c
> FROM TableA
> CACHE DATAFRAME
> This would be very useful for Rollups and Cubes, though I'm not sure what 
> this may mean for HiveMetaStore. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8577) ScalaReflectionLock.synchronized can cause deadlock

2019-01-04 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-8577.
-
Resolution: Not A Problem

I'm leaving resolved since we removed 2.10.

> ScalaReflectionLock.synchronized can cause deadlock
> ---
>
> Key: SPARK-8577
> URL: https://issues.apache.org/jira/browse/SPARK-8577
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: koert kuipers
>Priority: Minor
>
> Just a heads up, i was doing some basic coding using DataFrame, Row, 
> StructType, etc. in my own project and i ended up with deadlocks in my sbt 
> tests due to the usage of ScalaReflectionLock.synchronized in the spark sql 
> code.
> the issue went away when i changed my build to have:
>   parallelExecution in Test := false
> so that the tests run consecutively...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8403) Pruner partition won't effective when partition field and fieldSchema exist in sql predicate

2019-01-04 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-8403.
-
Resolution: Incomplete

Leaving this resolved due to reporter's inactivity.

> Pruner partition won't effective when partition field and fieldSchema exist 
> in sql predicate
> 
>
> Key: SPARK-8403
> URL: https://issues.apache.org/jira/browse/SPARK-8403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hong Shen
>Priority: Major
>
> When partition field and fieldSchema exist in sql predicates, pruner 
> partition won't effective.
> Here is the sql,
> {code}
> select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r 
> where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)>16)
> {code}
> Table t_dw_qqlive_209026  is partition by imp_date, itimestamp is a 
> fieldSchema in t_dw_qqlive_209026.
> When run on hive, it will only scan data in partition 20150615, but if run on 
> spark sql, it will scan the whole table t_dw_qqlive_209026.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8370) Add API for data sources to register databases

2019-01-04 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734205#comment-16734205
 ] 

Hyukjin Kwon commented on SPARK-8370:
-

I think we're working on multiple catalog support. I am resolving this. Please 
reopen this if I am not mistaken. Otherwise, let's track this in DatasourceV2

> Add API for data sources to register databases
> --
>
> Key: SPARK-8370
> URL: https://issues.apache.org/jira/browse/SPARK-8370
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Santiago M. Mola
>Priority: Major
>
> This API would allow to register a database with a data source instead of 
> just a table. Registering a data source database would register all its table 
> and maintain the catalog updated. The catalog could delegate to the data 
> source lookups of tables for a database registered with this API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7754) [SQL] Use PartialFunction literals instead of objects in Catalyst

2019-01-04 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-7754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734197#comment-16734197
 ] 

Hyukjin Kwon commented on SPARK-7754:
-

I don't think we're going to change if it targets mainly readability. It's been 
few years already and looks okay. I'm resolving this since there look not a lot 
of interests about this. 
Please reopen this if I am mistaken.

> [SQL] Use PartialFunction literals instead of objects in Catalyst
> -
>
> Key: SPARK-7754
> URL: https://issues.apache.org/jira/browse/SPARK-7754
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Edoardo Vacchi
>Priority: Minor
>
> Catalyst rules extend two distinct "rule" types: {{Rule[LogicalPlan]}} and 
> {{Strategy}} (which is an alias for {{GenericStrategy[SparkPlan]}}).
> The distinction is fairly subtle: in the end, both rule types are supposed to 
> define a method {{apply(plan: LogicalPlan)}}
> (where LogicalPlan is either Logical- or Spark-) which returns a transformed 
> plan (or a sequence thereof, in the case
> of Strategy).
> Ceremonies asides, the body of such method is always of the kind:
> {code:java}
>  def apply(plan: PlanType) = plan match pf
> {code}
> where `pf` would be some `PartialFunction` of the PlanType:
> {code:java}
>   val pf = {
> case ... => ...
>   }
> {code}
> This is JIRA is a proposal to introduce utility methods to
>   a) reduce the boilerplate to define rewrite rules
>   b) turning them back into what they essentially represent: function types.
> These changes would be backwards compatible, and would greatly help in 
> understanding what the code does. Current use of objects is redundant and 
> possibly confusing.
> *{{Rule[LogicalPlan]}}*
> a) Introduce the utility object
> {code:java}
>   object rule 
> def rule(pf: PartialFunction[LogicalPlan, LogicalPlan]): 
> Rule[LogicalPlan] =
>   new Rule[LogicalPlan] {
> def apply (plan: LogicalPlan): LogicalPlan = plan transform pf
>   }
> def named(name: String)(pf: PartialFunction[LogicalPlan, LogicalPlan]): 
> Rule[LogicalPlan] =
>   new Rule[LogicalPlan] {
> override val ruleName = name
> def apply (plan: LogicalPlan): LogicalPlan = plan transform pf
>   }
> {code}
> b) progressively replace the boilerplate-y object definitions; e.g.
> {code:java}
> object MyRewriteRule extends Rule[LogicalPlan] {
>   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> case ... => ...
> }
> {code}
> with
> {code:java}
> // define a Rule[LogicalPlan]
> val MyRewriteRule = rule {
>   case ... => ...
> }
> {code}
> and/or :
> {code:java}
> // define a named Rule[LogicalPlan]
> val MyRewriteRule = rule.named("My rewrite rule") {
>   case ... => ...
> }
> {code}
> *Strategies*
> A similar solution could be applied to shorten the code for
> Strategies, which are total functions
> only because they are all supposed to manage the default case,
> possibly returning `Nil`. In this case
> we might introduce the following utility:
> {code:java}
> object strategy {
>   /**
>* Generate a Strategy from a PartialFunction[LogicalPlan, SparkPlan].
>* The partial function must therefore return *one single* SparkPlan for 
> each case.
>* The method will automatically wrap them in a [[Seq]].
>* Unhandled cases will automatically return Seq.empty
>*/
>   def apply(pf: PartialFunction[LogicalPlan, SparkPlan]): Strategy =
> new Strategy {
>   def apply(plan: LogicalPlan): Seq[SparkPlan] =
> if (pf.isDefinedAt(plan)) Seq(pf.apply(plan)) else Seq.empty
> }
>   /**
>* Generate a Strategy from a PartialFunction[ LogicalPlan, Seq[SparkPlan] 
> ].
>* The partial function must therefore return a Seq[SparkPlan] for each 
> case.
>* Unhandled cases will automatically return Seq.empty
>*/
>  def seq(pf: PartialFunction[LogicalPlan, Seq[SparkPlan]]): Strategy =
> new Strategy {
>   def apply(plan: LogicalPlan): Seq[SparkPlan] =
> if (pf.isDefinedAt(plan)) pf.apply(plan) else Seq.empty[SparkPlan]
> }
> }
> {code}
> Usage:
> {code:java}
> val mystrategy = strategy { case ... => ... }
> val seqstrategy = strategy.seq { case ... => ... }
> {code}
> *Further possible improvements:*
> Making the utility methods `implicit`, thereby
> further reducing the rewrite rules to:
> {code:java}
> // define a PartialFunction[LogicalPlan, LogicalPlan]
> // the implicit would convert it into a Rule[LogicalPlan] at the use sites
> val MyRewriteRule = {
>   case ... => ...
> }
> {code}
> *Caveats*
> Because of the way objects are initialized vs. vals, it might be necessary
> reorder instructions so that vals are actually initialized before they

1 2 >

1 - 100 of 149 matches

Mail list logo