[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-26543: -- Description: For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the ExchangeCoordinator will introduced to determine the number of post-shuffle partitions. But in some certain conditions,the coordinator performed not very well, there are always some tasks retained and they worked with Shuffle Read Size / Records 0.0B/0 ,We could increase the spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this action is unreasonable as targetPostShuffleInputSize Should not be set too large. As follow: !image-2019-01-05-13-18-30-487.png! We can filter the useless partition(0B) with ExchangeCoorditinator automatically was: For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the ExchangeCoordinator will introduced to determine the number of post-shuffle partitions. But in some certain conditions,the coordinator performed not very well, there are always some tasks retained and they worked with Shuffle Read Size / Records 0.0B/0 ,We could increase the spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this action is unreasonable as targetPostShuffleInputSize Should not be set too large. As follow: !screenshot-1.png! We can filter the useless partition(0B) with ExchangeCoorditinator automatically > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26543 > URL: https://issues.apache.org/jira/browse/SPARK-26543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Fix For: 2.3.0 > > Attachments: image-2019-01-05-13-18-30-487.png > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. As follow: > !image-2019-01-05-13-18-30-487.png! > We can filter the useless partition(0B) with ExchangeCoorditinator > automatically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26544) escape string when serialize map/array make it a valid json (keep alignment with hive)
[ https://issues.apache.org/jira/browse/SPARK-26544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-26544: --- Summary: escape string when serialize map/array make it a valid json (keep alignment with hive) (was: the string serialized from map/array type is not a valid json (while hive is)) > escape string when serialize map/array make it a valid json (keep alignment > with hive) > -- > > Key: SPARK-26544 > URL: https://issues.apache.org/jira/browse/SPARK-26544 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: EdisonWang >Priority: Major > > when reading a hive table with map/array type, the string serialized by > thrift server is not a valid json, while hive is. > For example, select a field whose type is map, the spark > thrift server returns > > {code:java} > {"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"} > {code} > > while hive thriftserver returns > > {code:java} > {"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"} > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26544) escape string when serialize map/array to make it a valid json (keep alignment with hive)
[ https://issues.apache.org/jira/browse/SPARK-26544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-26544: --- Summary: escape string when serialize map/array to make it a valid json (keep alignment with hive) (was: escape string when serialize map/array make it a valid json (keep alignment with hive)) > escape string when serialize map/array to make it a valid json (keep > alignment with hive) > - > > Key: SPARK-26544 > URL: https://issues.apache.org/jira/browse/SPARK-26544 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: EdisonWang >Priority: Major > > when reading a hive table with map/array type, the string serialized by > thrift server is not a valid json, while hive is. > For example, select a field whose type is map, the spark > thrift server returns > > {code:java} > {"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"} > {code} > > while hive thriftserver returns > > {code:java} > {"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"} > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-26543: -- Attachment: image-2019-01-05-13-18-30-487.png > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26543 > URL: https://issues.apache.org/jira/browse/SPARK-26543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Fix For: 2.3.0 > > Attachments: image-2019-01-05-13-18-30-487.png > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. As follow: > !screenshot-1.png! > We can filter the useless partition(0B) with ExchangeCoorditinator > automatically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-26543: -- Attachment: screenshot-1.png > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26543 > URL: https://issues.apache.org/jira/browse/SPARK-26543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Fix For: 2.3.0 > > Attachments: screenshot-1.png > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. As follow: > !15_24_38__12_27_2018.jpg! > We can filter the useless partition(0B) with ExchangeCoorditinator > automatically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-26543: -- Attachment: (was: screenshot-1.png) > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26543 > URL: https://issues.apache.org/jira/browse/SPARK-26543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Fix For: 2.3.0 > > Attachments: image-2019-01-05-13-18-30-487.png > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. As follow: > !screenshot-1.png! > We can filter the useless partition(0B) with ExchangeCoorditinator > automatically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-26543: -- Description: For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the ExchangeCoordinator will introduced to determine the number of post-shuffle partitions. But in some certain conditions,the coordinator performed not very well, there are always some tasks retained and they worked with Shuffle Read Size / Records 0.0B/0 ,We could increase the spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this action is unreasonable as targetPostShuffleInputSize Should not be set too large. As follow: !screenshot-1.png! We can filter the useless partition(0B) with ExchangeCoorditinator automatically was: For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the ExchangeCoordinator will introduced to determine the number of post-shuffle partitions. But in some certain conditions,the coordinator performed not very well, there are always some tasks retained and they worked with Shuffle Read Size / Records 0.0B/0 ,We could increase the spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this action is unreasonable as targetPostShuffleInputSize Should not be set too large. As follow: {noformat} !screenshot-1.png! {noformat} We can filter the useless partition(0B) with ExchangeCoorditinator automatically > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26543 > URL: https://issues.apache.org/jira/browse/SPARK-26543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Fix For: 2.3.0 > > Attachments: screenshot-1.png > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. As follow: > !screenshot-1.png! > We can filter the useless partition(0B) with ExchangeCoorditinator > automatically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-26543: -- Description: For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the ExchangeCoordinator will introduced to determine the number of post-shuffle partitions. But in some certain conditions,the coordinator performed not very well, there are always some tasks retained and they worked with Shuffle Read Size / Records 0.0B/0 ,We could increase the spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this action is unreasonable as targetPostShuffleInputSize Should not be set too large. As follow: {noformat} !screenshot-1.png! {noformat} We can filter the useless partition(0B) with ExchangeCoorditinator automatically was: For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the ExchangeCoordinator will introduced to determine the number of post-shuffle partitions. But in some certain conditions,the coordinator performed not very well, there are always some tasks retained and they worked with Shuffle Read Size / Records 0.0B/0 ,We could increase the spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this action is unreasonable as targetPostShuffleInputSize Should not be set too large. As follow: We can filter the useless partition(0B) with ExchangeCoorditinator automatically > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26543 > URL: https://issues.apache.org/jira/browse/SPARK-26543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Fix For: 2.3.0 > > Attachments: screenshot-1.png > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. As follow: > {noformat} > !screenshot-1.png! > {noformat} > We can filter the useless partition(0B) with ExchangeCoorditinator > automatically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-26543: -- Description: For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the ExchangeCoordinator will introduced to determine the number of post-shuffle partitions. But in some certain conditions,the coordinator performed not very well, there are always some tasks retained and they worked with Shuffle Read Size / Records 0.0B/0 ,We could increase the spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this action is unreasonable as targetPostShuffleInputSize Should not be set too large. As follow: We can filter the useless partition(0B) with ExchangeCoorditinator automatically was: For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the ExchangeCoordinator will introduced to determine the number of post-shuffle partitions. But in some certain conditions,the coordinator performed not very well, there are always some tasks retained and they worked with Shuffle Read Size / Records 0.0B/0 ,We could increase the spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this action is unreasonable as targetPostShuffleInputSize Should not be set too large. As follow: !screenshot-1.png! !15_24_38__12_27_2018.jpg! We can filter the useless partition(0B) with ExchangeCoorditinator automatically > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26543 > URL: https://issues.apache.org/jira/browse/SPARK-26543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Fix For: 2.3.0 > > Attachments: screenshot-1.png > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. As follow: > We can filter the useless partition(0B) with ExchangeCoorditinator > automatically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-26543: -- Description: For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the ExchangeCoordinator will introduced to determine the number of post-shuffle partitions. But in some certain conditions,the coordinator performed not very well, there are always some tasks retained and they worked with Shuffle Read Size / Records 0.0B/0 ,We could increase the spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this action is unreasonable as targetPostShuffleInputSize Should not be set too large. As follow: !screenshot-1.png! !15_24_38__12_27_2018.jpg! We can filter the useless partition(0B) with ExchangeCoorditinator automatically was: For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the ExchangeCoordinator will introduced to determine the number of post-shuffle partitions. But in some certain conditions,the coordinator performed not very well, there are always some tasks retained and they worked with Shuffle Read Size / Records 0.0B/0 ,We could increase the spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this action is unreasonable as targetPostShuffleInputSize Should not be set too large. As follow: !15_24_38__12_27_2018.jpg! We can filter the useless partition(0B) with ExchangeCoorditinator automatically > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26543 > URL: https://issues.apache.org/jira/browse/SPARK-26543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Fix For: 2.3.0 > > Attachments: screenshot-1.png > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. As follow: > !screenshot-1.png! !15_24_38__12_27_2018.jpg! > We can filter the useless partition(0B) with ExchangeCoorditinator > automatically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-26543: -- Attachment: (was: 15_24_38__12_27_2018.jpg) > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26543 > URL: https://issues.apache.org/jira/browse/SPARK-26543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Fix For: 2.3.0 > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. As follow: > !15_24_38__12_27_2018.jpg! > We can filter the useless partition(0B) with ExchangeCoorditinator > automatically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26544) the string serialized from map/array type is not a valid json (while hive is)
EdisonWang created SPARK-26544: -- Summary: the string serialized from map/array type is not a valid json (while hive is) Key: SPARK-26544 URL: https://issues.apache.org/jira/browse/SPARK-26544 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: EdisonWang when reading a hive table with map/array type, the string serialized by thrift server is not a valid json, while hive is. For example, select a field whose type is map, the spark thrift server returns {code:java} {"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"} {code} while hive thriftserver returns {code:java} {"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"} {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-26543: -- Description: For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the ExchangeCoordinator will introduced to determine the number of post-shuffle partitions. But in some certain conditions,the coordinator performed not very well, there are always some tasks retained and they worked with Shuffle Read Size / Records 0.0B/0 ,We could increase the spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this action is unreasonable as targetPostShuffleInputSize Should not be set too large. As follow: !15_24_38__12_27_2018.jpg! We can filter the useless partition(0B) with ExchangeCoorditinator automatically was: For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the ExchangeCoordinator will introduced to determine the number of post-shuffle partitions. But in some certain conditions,the coordinator performed not very well, there are always some tasks retained and they worked with Shuffle Read Size / Records 0.0B/0 ,We could increase the spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this action is unreasonable as targetPostShuffleInputSize Should not be set too large. As follow: !15_24_38__12_27_2018.jpg! We can filter the useless partition(0B) with ExchangeCoorditinator automatically > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26543 > URL: https://issues.apache.org/jira/browse/SPARK-26543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Fix For: 2.3.0 > > Attachments: 15_24_38__12_27_2018.jpg > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. As follow: > !15_24_38__12_27_2018.jpg! > We can filter the useless partition(0B) with ExchangeCoorditinator > automatically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-26543: -- Description: For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the ExchangeCoordinator will introduced to determine the number of post-shuffle partitions. But in some certain conditions,the coordinator performed not very well, there are always some tasks retained and they worked with Shuffle Read Size / Records 0.0B/0 ,We could increase the spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this action is unreasonable as targetPostShuffleInputSize Should not be set too large. As follow: !15_24_38__12_27_2018.jpg! We can filter the useless partition(0B) with ExchangeCoorditinator automatically was: For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the ExchangeCoordinator will introduced to determine the number of post-shuffle partitions. But in some certain conditions,the coordinator performed not very well, there are always some tasks retained and they worked with Shuffle Read Size / Records 0.0B/0 ,We could increase the spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this action is unreasonable as targetPostShuffleInputSize Should not be set too large. We can filter the useless partition(0B) with ExchangeCoorditinator automatically > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26543 > URL: https://issues.apache.org/jira/browse/SPARK-26543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Fix For: 2.3.0 > > Attachments: 15_24_38__12_27_2018.jpg > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. As follow: > !15_24_38__12_27_2018.jpg! > We can filter the useless partition(0B) with ExchangeCoorditinator > automatically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably
[ https://issues.apache.org/jira/browse/SPARK-26543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chenliang updated SPARK-26543: -- Attachment: 15_24_38__12_27_2018.jpg > Support the coordinator to demerminte post-shuffle partitions more reasonably > - > > Key: SPARK-26543 > URL: https://issues.apache.org/jira/browse/SPARK-26543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: chenliang >Priority: Major > Fix For: 2.3.0 > > Attachments: 15_24_38__12_27_2018.jpg > > > For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the > ExchangeCoordinator will introduced to determine the number of post-shuffle > partitions. But in some certain conditions,the coordinator performed not very > well, there are always some tasks retained and they worked with Shuffle Read > Size / Records 0.0B/0 ,We could increase the > spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this > action is unreasonable as targetPostShuffleInputSize Should not be set too > large. > We can filter the useless partition(0B) with ExchangeCoorditinator > automatically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26543) Support the coordinator to demerminte post-shuffle partitions more reasonably
chenliang created SPARK-26543: - Summary: Support the coordinator to demerminte post-shuffle partitions more reasonably Key: SPARK-26543 URL: https://issues.apache.org/jira/browse/SPARK-26543 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.2, 2.3.1, 2.3.0, 2.2.2, 2.2.1, 2.2.0 Reporter: chenliang Fix For: 2.3.0 For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the ExchangeCoordinator will introduced to determine the number of post-shuffle partitions. But in some certain conditions,the coordinator performed not very well, there are always some tasks retained and they worked with Shuffle Read Size / Records 0.0B/0 ,We could increase the spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this action is unreasonable as targetPostShuffleInputSize Should not be set too large. We can filter the useless partition(0B) with ExchangeCoorditinator automatically -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26542) Support the coordinator to demerminte post-shuffle partitions more reasonably
chenliang created SPARK-26542: - Summary: Support the coordinator to demerminte post-shuffle partitions more reasonably Key: SPARK-26542 URL: https://issues.apache.org/jira/browse/SPARK-26542 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.2, 2.3.1, 2.3.0, 2.2.2, 2.2.1, 2.2.0 Reporter: chenliang Fix For: 2.3.0 For SparkSQL ,when we open AE by 'set spark.sql.adapative.enable=true',the ExchangeCoordinator will introduced to determine the number of post-shuffle partitions. But in some certain conditions,the coordinator performed not very well, there are always some tasks retained and they worked with Shuffle Read Size / Records 0.0B/0 ,We could increase the spark.sql.adaptive.shuffle.targetPostShuffleInputSize to solve this,but this action is unreasonable as targetPostShuffleInputSize Should not be set too large. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26536) Upgrade Mockito to 2.23.4
[ https://issues.apache.org/jira/browse/SPARK-26536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-26536. --- Resolution: Fixed Assignee: Dongjoon Hyun (was: Apache Spark) Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/23452 > Upgrade Mockito to 2.23.4 > - > > Key: SPARK-26536 > URL: https://issues.apache.org/jira/browse/SPARK-26536 > Project: Spark > Issue Type: Sub-task > Components: Build, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > This issue upgrades Mockito from 1.10.19 to 2.23.4. The following changes are > required. > - Replace `org.mockito.Matchers` with `org.mockito.ArgumentMatchers` > - Replace `anyObject` with `any` > - Replace `getArgumentAt` with `getArgument` and add type annotation. > - Use `isNull` matcher in case of `null` is invoked. > {code} > saslHandler.channelInactive(null); > -verify(handler).channelInactive(any(TransportClient.class)); > +verify(handler).channelInactive(isNull()); > {code} > - Make and use `doReturn` wrapper to avoid > [SI-4775|https://issues.scala-lang.org/browse/SI-4775] > {code} > private def doReturn(value: Any) = org.mockito.Mockito.doReturn(value, > Seq.empty: _*) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26541) Add `-Pdocker-integration-tests` to `dev/scalastyle`
[ https://issues.apache.org/jira/browse/SPARK-26541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26541: Assignee: Apache Spark > Add `-Pdocker-integration-tests` to `dev/scalastyle` > > > Key: SPARK-26541 > URL: https://issues.apache.org/jira/browse/SPARK-26541 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > This issue makes `scalastyle` to check `docker-integration-tests` module and > fixes one error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26541) Add `-Pdocker-integration-tests` to `dev/scalastyle`
[ https://issues.apache.org/jira/browse/SPARK-26541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26541: Assignee: (was: Apache Spark) > Add `-Pdocker-integration-tests` to `dev/scalastyle` > > > Key: SPARK-26541 > URL: https://issues.apache.org/jira/browse/SPARK-26541 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > This issue makes `scalastyle` to check `docker-integration-tests` module and > fixes one error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26541) Add `-Pdocker-integration-tests` to `dev/scalastyle`
Dongjoon Hyun created SPARK-26541: - Summary: Add `-Pdocker-integration-tests` to `dev/scalastyle` Key: SPARK-26541 URL: https://issues.apache.org/jira/browse/SPARK-26541 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: Dongjoon Hyun This issue makes `scalastyle` to check `docker-integration-tests` module and fixes one error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26540: -- Affects Version/s: 2.0.2 2.1.3 > Requirement failed when reading numeric arrays from PostgreSQL > -- > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26540: Assignee: (was: Apache Spark) > Requirement failed when reading numeric arrays from PostgreSQL > -- > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26540: Assignee: Apache Spark > Requirement failed when reading numeric arrays from PostgreSQL > -- > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark >Priority: Minor > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734731#comment-16734731 ] Dongjoon Hyun commented on SPARK-26540: --- Thank you for checking, [~maropu]. I made a PR and will add PostgreSQL integration test for this. - https://github.com/apache/spark/pull/23458 > Requirement failed when reading numeric arrays from PostgreSQL > -- > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734727#comment-16734727 ] Takeshi Yamamuro edited comment on SPARK-26540 at 1/5/19 1:42 AM: -- I checked the query failed in v2.0, but passed in v1.6. {code} // spark-1.6.3 scala> val pgTable = sqlContext.read.jdbc("jdbc:postgresql:postgres", "t", options) scala> pgTable.printSchema root |-- v: array (nullable = true) ||-- element: decimal(38,18) (containsNull = true) |-- d: decimal(38,18) (nullable = true) scala> pgTable.show +++ | v| d| +++ |[.222...|222.45550...| +++ {code} was (Author: maropu): yea, the workaround is good. The query failed in v2.0, but passed in v1.6. {code} // spark-1.6.3 scala> val pgTable = sqlContext.read.jdbc("jdbc:postgresql:postgres", "t", options) scala> pgTable.printSchema root |-- v: array (nullable = true) ||-- element: decimal(38,18) (containsNull = true) |-- d: decimal(38,18) (nullable = true) scala> pgTable.show +++ | v| d| +++ |[.222...|222.45550...| +++ {code} > Requirement failed when reading numeric arrays from PostgreSQL > -- > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734719#comment-16734719 ] Dongjoon Hyun edited comment on SPARK-26540 at 1/5/19 1:34 AM: --- Given that Apache Spark handles the decimal type with precision and scale, the reported case looks like a corner case. was (Author: dongjoon): Given that Apache Spark handles the decimal type with precision and scale, the reported case looks like a corner case. And, I believe this will be an minor improvement issue to cover that corner case. > Requirement failed when reading numeric arrays from PostgreSQL > -- > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734727#comment-16734727 ] Takeshi Yamamuro commented on SPARK-26540: -- yea, the workaround is good. The query failed in v2.0, but passed in v1.6. {code} // spark-1.6.3 scala> val pgTable = sqlContext.read.jdbc("jdbc:postgresql:postgres", "t", options) scala> pgTable.printSchema root |-- v: array (nullable = true) ||-- element: decimal(38,18) (containsNull = true) |-- d: decimal(38,18) (nullable = true) scala> pgTable.show +++ | v| d| +++ |[.222...|222.45550...| +++ {code} > Requirement failed when reading numeric arrays from PostgreSQL > -- > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734719#comment-16734719 ] Dongjoon Hyun edited comment on SPARK-26540 at 1/5/19 1:41 AM: --- Given that Apache Spark handles the decimal type with precision and scale, the reported case looks like a corner case. Sorry, I found that the original email also described this situation. I missed that. Yep, I agree. I'll make a PR for this one. was (Author: dongjoon): Given that Apache Spark handles the decimal type with precision and scale, the reported case looks like a corner case. > Requirement failed when reading numeric arrays from PostgreSQL > -- > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26540: -- Issue Type: Bug (was: Improvement) > Requirement failed when reading numeric arrays from PostgreSQL > -- > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734719#comment-16734719 ] Dongjoon Hyun commented on SPARK-26540: --- Given that Apache Spark handles the decimal type with precision and scale, the reported case looks like a corner case. And, I believe this will be an minor improvement issue to cover that corner case. > Requirement failed when reading numeric arrays from PostgreSQL > -- > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26540: -- Issue Type: Improvement (was: Bug) > Requirement failed when reading numeric arrays from PostgreSQL > -- > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26540: -- Priority: Minor (was: Major) > Requirement failed when reading numeric arrays from PostgreSQL > -- > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734718#comment-16734718 ] Dongjoon Hyun commented on SPARK-26540: --- [~maropu]. Please specify the precision and scale. Then, it will work. {code} CREATE TABLE t (v numeric(7,3)[], d numeric); {code} {code} scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) pgTable: org.apache.spark.sql.DataFrame = [v: array, d: decimal(38,18)] scala> pgTable.show +++ | v| d| +++ |[.222, .332]|222.45550...| +++ {code} > Requirement failed when reading numeric arrays from PostgreSQL > -- > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Major > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734715#comment-16734715 ] Dongjoon Hyun commented on SPARK-26540: --- I'm checking the information PostgresJDBC ResultSet. I'll post the result here, [~maropu]. > Requirement failed when reading numeric arrays from PostgreSQL > -- > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Major > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734715#comment-16734715 ] Dongjoon Hyun edited comment on SPARK-26540 at 1/5/19 12:58 AM: I'm checking the information from PostgresJDBC ResultSet. I'll post the result here, [~maropu]. was (Author: dongjoon): I'm checking the information PostgresJDBC ResultSet. I'll post the result here, [~maropu]. > Requirement failed when reading numeric arrays from PostgreSQL > -- > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Major > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734708#comment-16734708 ] Dongjoon Hyun commented on SPARK-26540: --- For me, this is not a correctness issue and not a regression. Could you check the older Spark like 2.0/2.1? > Requirement failed when reading numeric arrays from PostgreSQL > -- > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Major > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734708#comment-16734708 ] Dongjoon Hyun edited comment on SPARK-26540 at 1/5/19 12:41 AM: For me, this is not a correctness issue and not a regression. Could you check the older Spark like 2.0/2.1? I guess this is also not a data loss issue. was (Author: dongjoon): For me, this is not a correctness issue and not a regression. Could you check the older Spark like 2.0/2.1? > Requirement failed when reading numeric arrays from PostgreSQL > -- > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Major > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734701#comment-16734701 ] Takeshi Yamamuro commented on SPARK-26540: -- [~dongjoon] [~smilegator] could you check if this issue is worth blocking the release processes? Could you check? > Requirement failed when reading numeric arrays from PostgreSQL > -- > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Major > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala> options.setProperty("user", "maropu") > scala> options.setProperty("password", "") > scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) > scala> pgTable.printSchema > root > |-- v: array (nullable = true) > ||-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true) > scala> pgTable.show > 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) > ... > {code} > I looked over the related code and then I think we need more logics to handle > numeric arrays; > https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-26540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-26540: - Description: This bug was reported in spark-user: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html To reproduce this; {code} // Creates a table in a PostgreSQL shell postgres=# CREATE TABLE t (v numeric[], d numeric); CREATE TABLE postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); INSERT 0 1 postgres=# SELECT * FROM t; v |d -+-- {.222,.332} | 222.4555 (1 row) postgres=# \d t Table "public.t" Column | Type| Modifiers +---+--- v | numeric[] | d | numeric | // Then, reads it in Spark ./bin/spark-shell --jars=postgresql-42.2.4.jar -v scala> import java.util.Properties scala> val options = new Properties(); scala> options.setProperty("driver", "org.postgresql.Driver") scala> options.setProperty("user", "maropu") scala> options.setProperty("password", "") scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) scala> pgTable.printSchema root |-- v: array (nullable = true) ||-- element: decimal(0,0) (containsNull = true) |-- d: decimal(38,18) (nullable = true) scala> pgTable.show 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 0 at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) ... {code} I looked over the related code and then I think we need more logics to handle numeric arrays; https://github.com/apache/spark/blob/2a30deb85ae4e42c5cbc936383dd5c3970f4a74f/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L41 was: This bug was reported in spark-user: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html To reproduce this; {code} // Creates a table in a PostgreSQL shell postgres=# CREATE TABLE t (v numeric[], d numeric); CREATE TABLE postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); INSERT 0 1 postgres=# SELECT * FROM t; v |d -+-- {.222,.332} | 222.4555 (1 row) postgres=# \d t Table "public.t" Column | Type| Modifiers +---+--- v | numeric[] | d | numeric | // Then, reads it in Spark ./bin/spark-shell --jars=postgresql-42.2.4.jar -v scala> import java.util.Properties scala> val options = new Properties(); scala> options.setProperty("driver", "org.postgresql.Driver") scala> options.setProperty("user", "maropu") scala> options.setProperty("password", "") scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) scala> pgTable.printSchema root |-- v: array (nullable = true) ||-- element: decimal(0,0) (containsNull = true) |-- d: decimal(38,18) (nullable = true) scala> pgTable.show 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 0 at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) ... {code} > Requirement failed when reading numeric arrays from PostgreSQL > -- > > Key: SPARK-26540 > URL: https://issues.apache.org/jira/browse/SPARK-26540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Major > > This bug was reported in spark-user: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html > To reproduce this; > {code} > // Creates a table in a PostgreSQL shell > postgres=# CREATE TABLE t (v numeric[], d numeric); > CREATE TABLE > postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); > INSERT 0 1 > postgres=# SELECT * FROM t; > v |d > -+-- > {.222,.332} | 222.4555 > (1 row) > postgres=# \d t > Table "public.t" > Column | Type| Modifiers > +---+--- > v | numeric[] | > d | numeric | > // Then, reads it in Spark > ./bin/spark-shell --jars=postgresql-42.2.4.jar -v > scala> import java.util.Properties > scala> val options = new Properties(); > scala> options.setProperty("driver", "org.postgresql.Driver") > scala>
[jira] [Created] (SPARK-26540) Requirement failed when reading numeric arrays from PostgreSQL
Takeshi Yamamuro created SPARK-26540: Summary: Requirement failed when reading numeric arrays from PostgreSQL Key: SPARK-26540 URL: https://issues.apache.org/jira/browse/SPARK-26540 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 2.3.2, 2.2.2 Reporter: Takeshi Yamamuro This bug was reported in spark-user: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jdbc-postgres-numeric-array-td34280.html To reproduce this; {code} // Creates a table in a PostgreSQL shell postgres=# CREATE TABLE t (v numeric[], d numeric); CREATE TABLE postgres=# INSERT INTO t VALUES('{.222,.332}', 222.4555); INSERT 0 1 postgres=# SELECT * FROM t; v |d -+-- {.222,.332} | 222.4555 (1 row) postgres=# \d t Table "public.t" Column | Type| Modifiers +---+--- v | numeric[] | d | numeric | // Then, reads it in Spark ./bin/spark-shell --jars=postgresql-42.2.4.jar -v scala> import java.util.Properties scala> val options = new Properties(); scala> options.setProperty("driver", "org.postgresql.Driver") scala> options.setProperty("user", "maropu") scala> options.setProperty("password", "") scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options) scala> pgTable.printSchema root |-- v: array (nullable = true) ||-- element: decimal(0,0) (containsNull = true) |-- d: decimal(38,18) (nullable = true) scala> pgTable.show 9/01/05 09:16:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 0 at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:116) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) ... {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26078) WHERE .. IN fails to filter rows when used in combination with UNION
[ https://issues.apache.org/jira/browse/SPARK-26078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26078: -- Fix Version/s: 2.4.1 > WHERE .. IN fails to filter rows when used in combination with UNION > > > Key: SPARK-26078 > URL: https://issues.apache.org/jira/browse/SPARK-26078 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0 >Reporter: Arttu Voutilainen >Assignee: Marco Gaido >Priority: Blocker > Labels: correctness > Fix For: 2.4.1, 3.0.0 > > > Hey, > We encountered a case where Spark SQL does not seem to handle WHERE .. IN > correctly, when used in combination with UNION, but instead returns also rows > that do not fulfill the condition. Swapping the order of the datasets in the > UNION makes the problem go away. Repro below: > > {code} > sql = SQLContext(sc) > a = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}]) > b = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}]) > a.registerTempTable('a') > b.registerTempTable('b') > bug = sql.sql(""" > SELECT id,num,source FROM > ( > SELECT id, num, 'a' as source FROM a > UNION ALL > SELECT id, num, 'b' as source FROM b > ) AS c > WHERE c.id IN (SELECT id FROM b WHERE num = 2) > """) > no_bug = sql.sql(""" > SELECT id,num,source FROM > ( > SELECT id, num, 'b' as source FROM b > UNION ALL > SELECT id, num, 'a' as source FROM a > ) AS c > WHERE c.id IN (SELECT id FROM b WHERE num = 2) > """) > bug.show() > no_bug.show() > bug.explain(True) > no_bug.explain(True) > {code} > This results in one extra row in the "bug" DF coming from DF "b", that should > not be there as it > {code:java} > >>> bug.show() > +---+---+--+ > | id|num|source| > +---+---+--+ > | a| 2| a| > | a| 2| b| > | b| 1| b| > +---+---+--+ > >>> no_bug.show() > +---+---+--+ > | id|num|source| > +---+---+--+ > | a| 2| b| > | a| 2| a| > +---+---+--+ > {code} > The reason can be seen in the query plans: > {code:java} > >>> bug.explain(True) > ... > == Optimized Logical Plan == > Union > :- Project [id#0, num#1L, a AS source#136] > : +- Join LeftSemi, (id#0 = id#4) > : :- LogicalRDD [id#0, num#1L], false > : +- Project [id#4] > :+- Filter (isnotnull(num#5L) && (num#5L = 2)) > : +- LogicalRDD [id#4, num#5L], false > +- Join LeftSemi, (id#4#172 = id#4#172) >:- Project [id#4, num#5L, b AS source#137] >: +- LogicalRDD [id#4, num#5L], false >+- Project [id#4 AS id#4#172] > +- Filter (isnotnull(num#5L) && (num#5L = 2)) > +- LogicalRDD [id#4, num#5L], false > {code} > Note the line *+- Join LeftSemi, (id#4#172 = id#4#172)* - this condition > seems wrong, and I believe it causes the LeftSemi to return true for all rows > in the left-hand-side table, thus failing to filter as the WHERE .. IN > should. Compare with the non-buggy version, where both LeftSemi joins have > distinct #-things on both sides: > {code:java} > >>> no_bug.explain() > ... > == Optimized Logical Plan == > Union > :- Project [id#4, num#5L, b AS source#142] > : +- Join LeftSemi, (id#4 = id#4#173) > : :- LogicalRDD [id#4, num#5L], false > : +- Project [id#4 AS id#4#173] > :+- Filter (isnotnull(num#5L) && (num#5L = 2)) > : +- LogicalRDD [id#4, num#5L], false > +- Project [id#0, num#1L, a AS source#143] >+- Join LeftSemi, (id#0 = id#4#173) > :- LogicalRDD [id#0, num#1L], false > +- Project [id#4 AS id#4#173] > +- Filter (isnotnull(num#5L) && (num#5L = 2)) > +- LogicalRDD [id#4, num#5L], false > {code} > > Best, > -Arttu > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26306) Flaky test: org.apache.spark.util.collection.SorterSuite
[ https://issues.apache.org/jira/browse/SPARK-26306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26306. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23425 [https://github.com/apache/spark/pull/23425] > Flaky test: org.apache.spark.util.collection.SorterSuite > > > Key: SPARK-26306 > URL: https://issues.apache.org/jira/browse/SPARK-26306 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Assignee: Sean Owen >Priority: Minor > Fix For: 3.0.0 > > > TestTimSort causes OOM frequently. > {code:java} > [info] org.apache.spark.util.collection.SorterSuite *** ABORTED *** (3 > seconds, 225 milliseconds) > [info] java.lang.OutOfMemoryError: Java heap space > [info] at > org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56) > [info] at > org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43) > [info] at > org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70) > [info] at > org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown > Source) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > [info] at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > [info] at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > [info] at > org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > [info] at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown > Source) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > [info] at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) > [info] at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) > [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1560) > [info] at > org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) > [info] at org.scalatest.FunSuiteLike$$Lambda$128/398936629.apply(Unknown > Source) > [info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396) > [info] at org.scalatest.SuperEngine$$Lambda$129/1905082148.apply(Unknown > Source) > [info] at scala.collection.immutable.List.foreach(List.scala:388) > [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > [info] at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229) > [info] at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228) > [info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > [info] at org.scalatest.Suite.run(Suite.scala:1147) > [info] at org.scalatest.Suite.run$(Suite.scala:1129) > [error] Uncaught exception when running > org.apache.spark.util.collection.SorterSuite: java.lang.OutOfMemoryError: > Java heap space > sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56) > at > org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43) > at > org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70) > at > org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown > Source) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown > Source) > at
[jira] [Resolved] (SPARK-26503) Get rid of spark.sql.legacy.timeParser.enabled
[ https://issues.apache.org/jira/browse/SPARK-26503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26503. --- Resolution: Won't Fix See PR discussion > Get rid of spark.sql.legacy.timeParser.enabled > -- > > Key: SPARK-26503 > URL: https://issues.apache.org/jira/browse/SPARK-26503 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > The flag is used in CSV/JSON datasources as well in time related functions to > control parsing/formatting date/timestamps. By default, DateTimeFormat is > used for the purpose but the flag allow switch back to SimpleDateFormat and > some fallback. In the major release 3.0, the flag should be removed, and new > formatters DateFormatter/TimestampFormatter should be used by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26306) Flaky test: org.apache.spark.util.collection.SorterSuite
[ https://issues.apache.org/jira/browse/SPARK-26306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26306: - Assignee: Sean Owen > Flaky test: org.apache.spark.util.collection.SorterSuite > > > Key: SPARK-26306 > URL: https://issues.apache.org/jira/browse/SPARK-26306 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Assignee: Sean Owen >Priority: Minor > > TestTimSort causes OOM frequently. > {code:java} > [info] org.apache.spark.util.collection.SorterSuite *** ABORTED *** (3 > seconds, 225 milliseconds) > [info] java.lang.OutOfMemoryError: Java heap space > [info] at > org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56) > [info] at > org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43) > [info] at > org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70) > [info] at > org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown > Source) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > [info] at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > [info] at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > [info] at > org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > [info] at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown > Source) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > [info] at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) > [info] at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) > [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1560) > [info] at > org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) > [info] at org.scalatest.FunSuiteLike$$Lambda$128/398936629.apply(Unknown > Source) > [info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396) > [info] at org.scalatest.SuperEngine$$Lambda$129/1905082148.apply(Unknown > Source) > [info] at scala.collection.immutable.List.foreach(List.scala:388) > [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > [info] at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229) > [info] at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228) > [info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > [info] at org.scalatest.Suite.run(Suite.scala:1147) > [info] at org.scalatest.Suite.run$(Suite.scala:1129) > [error] Uncaught exception when running > org.apache.spark.util.collection.SorterSuite: java.lang.OutOfMemoryError: > Java heap space > sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56) > at > org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43) > at > org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70) > at > org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown > Source) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown > Source) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) > at
[jira] [Assigned] (SPARK-26539) Remove spark.memory.useLegacyMode memory settings + StaticMemoryManager
[ https://issues.apache.org/jira/browse/SPARK-26539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26539: Assignee: Sean Owen (was: Apache Spark) > Remove spark.memory.useLegacyMode memory settings + StaticMemoryManager > --- > > Key: SPARK-26539 > URL: https://issues.apache.org/jira/browse/SPARK-26539 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > > The old memory manager was superseded by UnifiedMemoryManager in Spark 1.6, > and has been the default since. I think we could remove it to simplify the > code a little, but more importantly to reduce the variety of memory settings > users are confronted with when using Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26539) Remove spark.memory.useLegacyMode memory settings + StaticMemoryManager
[ https://issues.apache.org/jira/browse/SPARK-26539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-26539: -- Docs Text: The legacy memory manager, enabled by spark.memory.useLegacyMode, has been removed in Spark 3. It is effectively always false, its default since Spark 1.6. Associated settings spark.shuffle.memoryFraction, spark.storage.memoryFraction,and spark.storage.unrollFraction have been removed as well. Labels: release-notes (was: ) I should say I think this is still open to discussion, but per a quick dev@ thread about it, didn't hear strong objections. > Remove spark.memory.useLegacyMode memory settings + StaticMemoryManager > --- > > Key: SPARK-26539 > URL: https://issues.apache.org/jira/browse/SPARK-26539 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > Labels: release-notes > > The old memory manager was superseded by UnifiedMemoryManager in Spark 1.6, > and has been the default since. I think we could remove it to simplify the > code a little, but more importantly to reduce the variety of memory settings > users are confronted with when using Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26539) Remove spark.memory.useLegacyMode memory settings + StaticMemoryManager
[ https://issues.apache.org/jira/browse/SPARK-26539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26539: Assignee: Apache Spark (was: Sean Owen) > Remove spark.memory.useLegacyMode memory settings + StaticMemoryManager > --- > > Key: SPARK-26539 > URL: https://issues.apache.org/jira/browse/SPARK-26539 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Apache Spark >Priority: Major > > The old memory manager was superseded by UnifiedMemoryManager in Spark 1.6, > and has been the default since. I think we could remove it to simplify the > code a little, but more importantly to reduce the variety of memory settings > users are confronted with when using Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26539) Remove spark.memory.useLegacyMode memory settings + StaticMemoryManager
Sean Owen created SPARK-26539: - Summary: Remove spark.memory.useLegacyMode memory settings + StaticMemoryManager Key: SPARK-26539 URL: https://issues.apache.org/jira/browse/SPARK-26539 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 3.0.0 Reporter: Sean Owen Assignee: Sean Owen The old memory manager was superseded by UnifiedMemoryManager in Spark 1.6, and has been the default since. I think we could remove it to simplify the code a little, but more importantly to reduce the variety of memory settings users are confronted with when using Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26538) Postgres numeric array support
[ https://issues.apache.org/jira/browse/SPARK-26538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26538: Assignee: (was: Apache Spark) > Postgres numeric array support > -- > > Key: SPARK-26538 > URL: https://issues.apache.org/jira/browse/SPARK-26538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.1 > Environment: PostgreSQL 10.4, 9.6.9. >Reporter: Oleksii >Priority: Minor > > Consider the following table definition: > {code:sql} > create table test1 > ( > v numeric[], > d numeric > ); > insert into test1 values('{.222,.332}', 222.4555); > {code} > When reading the table into a Dataframe, I get the following schema: > {noformat} > root > |-- v: array (nullable = true) > | |-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true){noformat} > Notice that for both columns precision and scale were not specified, but in > case of the array element I got both set to 0, while in the other case > defaults were set. > Later, when I try to read the Dataframe, I get the following error: > {noformat} > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:224) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:453) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$16$$anonfun$apply$6$$anonfun$apply$7.apply(JdbcUtils.scala:474) > ...{noformat} > I would expect to get array elements of type decimal(38,18) and no error when > reading in this case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26538) Postgres numeric array support
[ https://issues.apache.org/jira/browse/SPARK-26538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26538: Assignee: Apache Spark > Postgres numeric array support > -- > > Key: SPARK-26538 > URL: https://issues.apache.org/jira/browse/SPARK-26538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.1 > Environment: PostgreSQL 10.4, 9.6.9. >Reporter: Oleksii >Assignee: Apache Spark >Priority: Minor > > Consider the following table definition: > {code:sql} > create table test1 > ( > v numeric[], > d numeric > ); > insert into test1 values('{.222,.332}', 222.4555); > {code} > When reading the table into a Dataframe, I get the following schema: > {noformat} > root > |-- v: array (nullable = true) > | |-- element: decimal(0,0) (containsNull = true) > |-- d: decimal(38,18) (nullable = true){noformat} > Notice that for both columns precision and scale were not specified, but in > case of the array element I got both set to 0, while in the other case > defaults were set. > Later, when I try to read the Dataframe, I get the following error: > {noformat} > java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 > exceeds max precision 0 > at scala.Predef$.require(Predef.scala:224) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:453) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$16$$anonfun$apply$6$$anonfun$apply$7.apply(JdbcUtils.scala:474) > ...{noformat} > I would expect to get array elements of type decimal(38,18) and no error when > reading in this case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26538) Postgres numeric array support
Oleksii created SPARK-26538: --- Summary: Postgres numeric array support Key: SPARK-26538 URL: https://issues.apache.org/jira/browse/SPARK-26538 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.2, 2.2.2, 2.4.1 Environment: PostgreSQL 10.4, 9.6.9. Reporter: Oleksii Consider the following table definition: {code:sql} create table test1 ( v numeric[], d numeric ); insert into test1 values('{.222,.332}', 222.4555); {code} When reading the table into a Dataframe, I get the following schema: {noformat} root |-- v: array (nullable = true) | |-- element: decimal(0,0) (containsNull = true) |-- d: decimal(38,18) (nullable = true){noformat} Notice that for both columns precision and scale were not specified, but in case of the array element I got both set to 0, while in the other case defaults were set. Later, when I try to read the Dataframe, I get the following error: {noformat} java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 0 at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:453) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$16$$anonfun$apply$6$$anonfun$apply$7.apply(JdbcUtils.scala:474) ...{noformat} I would expect to get array elements of type decimal(38,18) and no error when reading in this case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26537) update the release scripts to point to gitbox
[ https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734540#comment-16734540 ] shane knapp commented on SPARK-26537: - [~srowen] i'm pretty sure that my change to the root pom is fine, but a second set of eyes would be great. :) > update the release scripts to point to gitbox > - > > Key: SPARK-26537 > URL: https://issues.apache.org/jira/browse/SPARK-26537 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.3.0, 2.4.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > we're seeing packaging build failures like this: > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console] > i did a quick skim through the repo, and found the offending urls to the old > apache git repos: > > {code:java} > (py35) ➜ spark git:(update-apache-repo) grep -r git-wip * > dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git" > dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git; > dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git; > pom.xml: > scm:git:https://git-wip-us.apache.org/repos/asf/spark.git > {code} > this affects all versions of spark, so it will need to be backported to all > released versions. > i'll put together a pull request later today. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26537) update the release scripts to point to gitbox
[ https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26537: Assignee: shane knapp (was: Apache Spark) > update the release scripts to point to gitbox > - > > Key: SPARK-26537 > URL: https://issues.apache.org/jira/browse/SPARK-26537 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.3.0, 2.4.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > we're seeing packaging build failures like this: > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console] > i did a quick skim through the repo, and found the offending urls to the old > apache git repos: > > {code:java} > (py35) ➜ spark git:(update-apache-repo) grep -r git-wip * > dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git" > dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git; > dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git; > pom.xml: > scm:git:https://git-wip-us.apache.org/repos/asf/spark.git > {code} > this affects all versions of spark, so it will need to be backported to all > released versions. > i'll put together a pull request later today. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26537) update the release scripts to point to gitbox
[ https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26537: Assignee: Apache Spark (was: shane knapp) > update the release scripts to point to gitbox > - > > Key: SPARK-26537 > URL: https://issues.apache.org/jira/browse/SPARK-26537 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.3.0, 2.4.0 >Reporter: shane knapp >Assignee: Apache Spark >Priority: Major > > we're seeing packaging build failures like this: > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console] > i did a quick skim through the repo, and found the offending urls to the old > apache git repos: > > {code:java} > (py35) ➜ spark git:(update-apache-repo) grep -r git-wip * > dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git" > dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git; > dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git; > pom.xml: > scm:git:https://git-wip-us.apache.org/repos/asf/spark.git > {code} > this affects all versions of spark, so it will need to be backported to all > released versions. > i'll put together a pull request later today. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26537) update the release scripts to point to gitbox
[ https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734531#comment-16734531 ] Dongjoon Hyun edited comment on SPARK-26537 at 1/4/19 8:20 PM: --- Thank you for investigating this, [~shaneknapp]. was (Author: dongjoon): Thank you for investing this, [~shaneknapp]. > update the release scripts to point to gitbox > - > > Key: SPARK-26537 > URL: https://issues.apache.org/jira/browse/SPARK-26537 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.3.0, 2.4.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > we're seeing packaging build failures like this: > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console] > i did a quick skim through the repo, and found the offending urls to the old > apache git repos: > > {code:java} > (py35) ➜ spark git:(update-apache-repo) grep -r git-wip * > dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git" > dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git; > dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git; > pom.xml: > scm:git:https://git-wip-us.apache.org/repos/asf/spark.git > {code} > this affects all versions of spark, so it will need to be backported to all > released versions. > i'll put together a pull request later today. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26537) update the release scripts to point to gitbox
[ https://issues.apache.org/jira/browse/SPARK-26537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734531#comment-16734531 ] Dongjoon Hyun commented on SPARK-26537: --- Thank you for investing this, [~shaneknapp]. > update the release scripts to point to gitbox > - > > Key: SPARK-26537 > URL: https://issues.apache.org/jira/browse/SPARK-26537 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.3.0, 2.4.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > we're seeing packaging build failures like this: > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console] > i did a quick skim through the repo, and found the offending urls to the old > apache git repos: > > {code:java} > (py35) ➜ spark git:(update-apache-repo) grep -r git-wip * > dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git" > dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git; > dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git; > pom.xml: > scm:git:https://git-wip-us.apache.org/repos/asf/spark.git > {code} > this affects all versions of spark, so it will need to be backported to all > released versions. > i'll put together a pull request later today. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26537) update the release scripts to point to gitbox
shane knapp created SPARK-26537: --- Summary: update the release scripts to point to gitbox Key: SPARK-26537 URL: https://issues.apache.org/jira/browse/SPARK-26537 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.4.0, 2.3.0, 2.1.0, 2.0.0, 1.6.0 Reporter: shane knapp Assignee: shane knapp we're seeing packaging build failures like this: [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2179/console] i did a quick skim through the repo, and found the offending urls to the old apache git repos: {code:java} (py35) ➜ spark git:(update-apache-repo) grep -r git-wip * dev/create-release/release-tag.sh:ASF_SPARK_REPO="git-wip-us.apache.org/repos/asf/spark.git" dev/create-release/release-util.sh:ASF_REPO="https://git-wip-us.apache.org/repos/asf/spark.git; dev/create-release/release-util.sh:ASF_REPO_WEBUI="https://git-wip-us.apache.org/repos/asf?p=spark.git; pom.xml: scm:git:https://git-wip-us.apache.org/repos/asf/spark.git {code} this affects all versions of spark, so it will need to be backported to all released versions. i'll put together a pull request later today. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734510#comment-16734510 ] Prashanth Sandela edited comment on SPARK-19809 at 1/4/19 7:40 PM: --- [~dongjoon] I'm encountering same similar issue with spark version 2.3.1 I'm trying to read from a table which was ingested by sqoop. There are few 0 byte files for this table. The file sizes looks like below: {noformat} -rw-rw-r-- 3 cloud-user root 17.3 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-0 -rw-rw-r-- 3 cloud-user root 10.3 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-1 -rw-rw-r-- 3 cloud-user root 19.9 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-2 -rw-rw-r-- 3 cloud-user root 13.0 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-3 -rw-rw-r-- 3 cloud-user root 0 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-4 -rw-rw-r-- 3 cloud-user root 3.4 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-5 -rw-rw-r-- 3 cloud-user root 13.8 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-6 -rw-rw-r-- 3 cloud-user root 0 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-7 -rw-rw-r-- 3 cloud-user root 0 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-8 -rw-rw-r-- 3 cloud-user root 6.9 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-9 -rw-rw-r-- 3 cloud-user root 9.0 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-00010 -rw-rw-r-- 3 cloud-user root 11.4 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-00011 -rw-rw-r-- 3 cloud-user root 14.7 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-00012 -rw-rw-r-- 3 cloud-user root 17.4 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-00013 -rw-rw-r-- 3 cloud-user root 17.1 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-00014{noformat} Spark throws exception while reading this table. {noformat} scala> spark.read.table("table_with_few_zero_byte_files").show() java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734523#comment-16734523 ] Prashanth Sandela commented on SPARK-19809: --- Awesome! This config of `spark.sql.orc.impl=native` works. Thanks for quick response. > NullPointerException on zero-size ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1 >Reporter: Michał Dawid >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.0 > > Attachments: image-2018-02-26-20-29-49-410.png, > spark.sql.hive.convertMetastoreOrc.txt > > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at
[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734521#comment-16734521 ] Dongjoon Hyun commented on SPARK-19809: --- Hi, did you use `spark.sql.orc.impl=native`, too? New ORC is not default in Spark 2.3.x. > NullPointerException on zero-size ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1 >Reporter: Michał Dawid >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.0 > > Attachments: image-2018-02-26-20-29-49-410.png, > spark.sql.hive.convertMetastoreOrc.txt > > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at
[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734510#comment-16734510 ] Prashanth Sandela commented on SPARK-19809: --- [~dongjoon] I'm encountering same similar issue with spark version 2.3.1 I'm trying to read from a table which was ingested by sqoop. There are few 0 byte files for this table. The file sizes looks like below: {noformat} -rw-rw-r-- 3 cloud-user root 17.3 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-0 -rw-rw-r-- 3 cloud-user root 10.3 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-1 -rw-rw-r-- 3 cloud-user root 19.9 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-2 -rw-rw-r-- 3 cloud-user root 13.0 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-3 -rw-rw-r-- 3 cloud-user root 0 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-4 -rw-rw-r-- 3 cloud-user root 3.4 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-5 -rw-rw-r-- 3 cloud-user root 13.8 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-6 -rw-rw-r-- 3 cloud-user root 0 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-7 -rw-rw-r-- 3 cloud-user root 0 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-8 -rw-rw-r-- 3 cloud-user root 6.9 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-9 -rw-rw-r-- 3 cloud-user root 9.0 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-00010 -rw-rw-r-- 3 cloud-user root 11.4 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-00011 -rw-rw-r-- 3 cloud-user root 14.7 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-00012 -rw-rw-r-- 3 cloud-user root 17.4 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-00013 -rw-rw-r-- 3 cloud-user root 17.1 M 2019-01-03 22:20 /apps/hive/warehouse/default.db/table_with_few_zero_byte_files/part-m-00014{noformat} Spark throws exception while doing count {noformat} scala> spark.read.table("table_with_few_zero_byte_files").show() java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at
[jira] [Commented] (SPARK-26254) Move delegation token providers into a separate project
[ https://issues.apache.org/jira/browse/SPARK-26254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734393#comment-16734393 ] Gabor Somogyi commented on SPARK-26254: --- [~hyukjin.kwon] In my last comment right before the ping almost everything is clear but don't know what is the suggestion related kafka. > Move delegation token providers into a separate project > --- > > Key: SPARK-26254 > URL: https://issues.apache.org/jira/browse/SPARK-26254 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > There was a discussion in > [PR#22598|https://github.com/apache/spark/pull/22598] that there are several > provided dependencies inside core project which shouldn't be there (for ex. > hive and kafka). This jira is to solve this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26089) Handle large corrupt shuffle blocks
[ https://issues.apache.org/jira/browse/SPARK-26089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26089: Assignee: Apache Spark > Handle large corrupt shuffle blocks > --- > > Key: SPARK-26089 > URL: https://issues.apache.org/jira/browse/SPARK-26089 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Apache Spark >Priority: Major > > We've seen a bad disk lead to corruption in a shuffle block, which lead to > tasks repeatedly failing after fetching the data with an IOException. The > tasks get retried, but the same corrupt data gets fetched again, and the > tasks keep failing. As there isn't a fetch-failure, the jobs eventually > fail, spark never tries to regenerate the shuffle data. > This is the same as SPARK-4105, but that fix only covered small blocks. > There was some discussion during that change about this limitation > (https://github.com/apache/spark/pull/15923#discussion_r88756017) and > followups to cover larger blocks (which would involve spilling to disk to > avoid OOM), but it looks like that never happened. > I can think of a few approaches to this: > 1) wrap the shuffle block input stream with another input stream, that > converts all exceptions into FetchFailures. This is similar to the fix of > SPARK-4105, but that reads the entire input stream up-front, and instead I'm > proposing to do it within the InputStream itself so its streaming and does > not have a large memory overhead. > 2) Add checksums to shuffle blocks. This was proposed > [here|https://github.com/apache/spark/pull/15894] and abandoned as being too > complex. > 3) Try to tackle this with blacklisting instead: when there is any failure in > a task that is reading shuffle data, assign some "blame" to the source of the > shuffle data, and eventually blacklist the source. It seems really tricky to > get sensible heuristics for this, though. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26089) Handle large corrupt shuffle blocks
[ https://issues.apache.org/jira/browse/SPARK-26089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26089: Assignee: (was: Apache Spark) > Handle large corrupt shuffle blocks > --- > > Key: SPARK-26089 > URL: https://issues.apache.org/jira/browse/SPARK-26089 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > > We've seen a bad disk lead to corruption in a shuffle block, which lead to > tasks repeatedly failing after fetching the data with an IOException. The > tasks get retried, but the same corrupt data gets fetched again, and the > tasks keep failing. As there isn't a fetch-failure, the jobs eventually > fail, spark never tries to regenerate the shuffle data. > This is the same as SPARK-4105, but that fix only covered small blocks. > There was some discussion during that change about this limitation > (https://github.com/apache/spark/pull/15923#discussion_r88756017) and > followups to cover larger blocks (which would involve spilling to disk to > avoid OOM), but it looks like that never happened. > I can think of a few approaches to this: > 1) wrap the shuffle block input stream with another input stream, that > converts all exceptions into FetchFailures. This is similar to the fix of > SPARK-4105, but that reads the entire input stream up-front, and instead I'm > proposing to do it within the InputStream itself so its streaming and does > not have a large memory overhead. > 2) Add checksums to shuffle blocks. This was proposed > [here|https://github.com/apache/spark/pull/15894] and abandoned as being too > complex. > 3) Try to tackle this with blacklisting instead: when there is any failure in > a task that is reading shuffle data, assign some "blame" to the source of the > shuffle data, and eventually blacklist the source. It seems really tricky to > get sensible heuristics for this, though. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26362) Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark contexts
[ https://issues.apache.org/jira/browse/SPARK-26362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734371#comment-16734371 ] Hyukjin Kwon commented on SPARK-26362: -- Sure will do. > Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark > contexts > --- > > Key: SPARK-26362 > URL: https://issues.apache.org/jira/browse/SPARK-26362 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: releasenotes > Fix For: 3.0.0 > > > Multiple Spark contexts are discouraged and it has been warning from 4 years > ago (see SPARK-4180). > It could cause arbitrary and mysterious error cases. (Honestly, I didn't even > know Spark allows it). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-26366) Except with transform regression
[ https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-26366: Comment: was deleted (was: mgaido91 opened a new pull request #23372: [SPARK-26366][SQL][BACKPORT-2.3] ReplaceExceptWithFilter should consider NULL as False URL: https://github.com/apache/spark/pull/23372 ## What changes were proposed in this pull request? In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions. The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output. The PR fixes these problem by: - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it); - avoiding any transformation when the condition is non-deterministic. ## How was this patch tested? added UTs This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org ) > Except with transform regression > > > Key: SPARK-26366 > URL: https://issues.apache.org/jira/browse/SPARK-26366 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.2 >Reporter: Dan Osipov >Assignee: Marco Gaido >Priority: Major > Labels: correctness > Fix For: 2.3.3, 2.4.1, 3.0.0 > > > There appears to be a regression between Spark 2.2 and 2.3. Below is the code > to reproduce it: > > {code:java} > import org.apache.spark.sql.functions.col > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val inputDF = spark.sqlContext.createDataFrame( > spark.sparkContext.parallelize(Seq( > Row("0", "john", "smith", "j...@smith.com"), > Row("1", "jane", "doe", "j...@doe.com"), > Row("2", "apache", "spark", "sp...@apache.org"), > Row("3", "foo", "bar", null) > )), > StructType(List( > StructField("id", StringType, nullable=true), > StructField("first_name", StringType, nullable=true), > StructField("last_name", StringType, nullable=true), > StructField("email", StringType, nullable=true) > )) > ) > val exceptDF = inputDF.transform( toProcessDF => > toProcessDF.filter( > ( > col("first_name").isin(Seq("john", "jane"): _*) > and col("last_name").isin(Seq("smith", "doe"): _*) > ) > or col("email").isin(List(): _*) > ) > ) > inputDF.except(exceptDF).show() > {code} > Output with Spark 2.2: > {noformat} > +---+--+-++ > | id|first_name|last_name| email| > +---+--+-++ > | 2| apache| spark|sp...@apache.org| > | 3| foo| bar| null| > +---+--+-++{noformat} > Output with Spark 2.3: > {noformat} > +---+--+-++ > | id|first_name|last_name| email| > +---+--+-++ > | 2| apache| spark|sp...@apache.org| > +---+--+-++{noformat} > Note, changing the last line to > {code:java} > inputDF.except(exceptDF.cache()).show() > {code} > produces identical output for both Spark 2.3 and 2.2 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26362) Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark contexts
[ https://issues.apache.org/jira/browse/SPARK-26362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-26362: -- Docs Text: Support for multiple Spark contexts in the same driver has been removed in Spark 3. > Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark > contexts > --- > > Key: SPARK-26362 > URL: https://issues.apache.org/jira/browse/SPARK-26362 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: releasenotes > Fix For: 3.0.0 > > > Multiple Spark contexts are discouraged and it has been warning from 4 years > ago (see SPARK-4180). > It could cause arbitrary and mysterious error cases. (Honestly, I didn't even > know Spark allows it). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26362) Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark contexts
[ https://issues.apache.org/jira/browse/SPARK-26362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-26362: -- Priority: Minor (was: Major) > Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark > contexts > --- > > Key: SPARK-26362 > URL: https://issues.apache.org/jira/browse/SPARK-26362 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: releasenotes > Fix For: 3.0.0 > > > Multiple Spark contexts are discouraged and it has been warning from 4 years > ago (see SPARK-4180). > It could cause arbitrary and mysterious error cases. (Honestly, I didn't even > know Spark allows it). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26362) Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark contexts
[ https://issues.apache.org/jira/browse/SPARK-26362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-26362: Labels: releasenotes (was: ) > Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark > contexts > --- > > Key: SPARK-26362 > URL: https://issues.apache.org/jira/browse/SPARK-26362 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: releasenotes > Fix For: 3.0.0 > > > Multiple Spark contexts are discouraged and it has been warning from 4 years > ago (see SPARK-4180). > It could cause arbitrary and mysterious error cases. (Honestly, I didn't even > know Spark allows it). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-26362) Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark contexts
[ https://issues.apache.org/jira/browse/SPARK-26362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-26362: Comment: was deleted (was: asfgit closed pull request #23311: [SPARK-26362][CORE] Remove 'spark.driver.allowMultipleContexts' to disallow multiple creation of SparkContexts URL: https://github.com/apache/spark/pull/23311 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala b/core/src/main/scala/org/apache/spark/SparkContext.scala index 696dafda6d1ec..09cc346db0ed2 100644 --- a/core/src/main/scala/org/apache/spark/SparkContext.scala +++ b/core/src/main/scala/org/apache/spark/SparkContext.scala @@ -64,9 +64,8 @@ import org.apache.spark.util.logging.DriverLogger * Main entry point for Spark functionality. A SparkContext represents the connection to a Spark * cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. * - * Only one SparkContext may be active per JVM. You must `stop()` the active SparkContext before - * creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details. - * + * @note Only one `SparkContext` should be active per JVM. You must `stop()` the + * active `SparkContext` before creating a new one. * @param config a Spark Config object describing the application configuration. Any settings in * this config overrides the default configs as well as system properties. */ @@ -75,14 +74,10 @@ class SparkContext(config: SparkConf) extends Logging { // The call site where this SparkContext was constructed. private val creationSite: CallSite = Utils.getCallSite() - // If true, log warnings instead of throwing exceptions when multiple SparkContexts are active - private val allowMultipleContexts: Boolean = -config.getBoolean("spark.driver.allowMultipleContexts", false) - // In order to prevent multiple SparkContexts from being active at the same time, mark this // context as having started construction. // NOTE: this must be placed at the beginning of the SparkContext constructor. - SparkContext.markPartiallyConstructed(this, allowMultipleContexts) + SparkContext.markPartiallyConstructed(this) val startTime = System.currentTimeMillis() @@ -2392,7 +2387,7 @@ class SparkContext(config: SparkConf) extends Logging { // In order to prevent multiple SparkContexts from being active at the same time, mark this // context as having finished construction. // NOTE: this must be placed at the end of the SparkContext constructor. - SparkContext.setActiveContext(this, allowMultipleContexts) + SparkContext.setActiveContext(this) } /** @@ -2409,18 +2404,18 @@ object SparkContext extends Logging { private val SPARK_CONTEXT_CONSTRUCTOR_LOCK = new Object() /** - * The active, fully-constructed SparkContext. If no SparkContext is active, then this is `null`. + * The active, fully-constructed SparkContext. If no SparkContext is active, then this is `null`. * - * Access to this field is guarded by SPARK_CONTEXT_CONSTRUCTOR_LOCK. + * Access to this field is guarded by `SPARK_CONTEXT_CONSTRUCTOR_LOCK`. */ private val activeContext: AtomicReference[SparkContext] = new AtomicReference[SparkContext](null) /** - * Points to a partially-constructed SparkContext if some thread is in the SparkContext + * Points to a partially-constructed SparkContext if another thread is in the SparkContext * constructor, or `None` if no SparkContext is being constructed. * - * Access to this field is guarded by SPARK_CONTEXT_CONSTRUCTOR_LOCK + * Access to this field is guarded by `SPARK_CONTEXT_CONSTRUCTOR_LOCK`. */ private var contextBeingConstructed: Option[SparkContext] = None @@ -2428,24 +2423,16 @@ object SparkContext extends Logging { * Called to ensure that no other SparkContext is running in this JVM. * * Throws an exception if a running context is detected and logs a warning if another thread is - * constructing a SparkContext. This warning is necessary because the current locking scheme + * constructing a SparkContext. This warning is necessary because the current locking scheme * prevents us from reliably distinguishing between cases where another context is being * constructed and cases where another constructor threw an exception. */ - private def assertNoOtherContextIsRunning( - sc: SparkContext, - allowMultipleContexts: Boolean): Unit = { + private def assertNoOtherContextIsRunning(sc: SparkContext): Unit = { SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {
[jira] [Issue Comment Deleted] (SPARK-26366) Except with transform regression
[ https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-26366: Comment: was deleted (was: mgaido91 opened a new pull request #23350: [SPARK-26366][SQL][BACKPORT-2.3] ReplaceExceptWithFilter should consider NULL as False URL: https://github.com/apache/spark/pull/23350 ## What changes were proposed in this pull request? In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions. The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output. The PR fixes these problem by: - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it); - avoiding any transformation when the condition is non-deterministic. ## How was this patch tested? added UTs This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org ) > Except with transform regression > > > Key: SPARK-26366 > URL: https://issues.apache.org/jira/browse/SPARK-26366 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.2 >Reporter: Dan Osipov >Assignee: Marco Gaido >Priority: Major > Labels: correctness > Fix For: 2.3.3, 2.4.1, 3.0.0 > > > There appears to be a regression between Spark 2.2 and 2.3. Below is the code > to reproduce it: > > {code:java} > import org.apache.spark.sql.functions.col > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val inputDF = spark.sqlContext.createDataFrame( > spark.sparkContext.parallelize(Seq( > Row("0", "john", "smith", "j...@smith.com"), > Row("1", "jane", "doe", "j...@doe.com"), > Row("2", "apache", "spark", "sp...@apache.org"), > Row("3", "foo", "bar", null) > )), > StructType(List( > StructField("id", StringType, nullable=true), > StructField("first_name", StringType, nullable=true), > StructField("last_name", StringType, nullable=true), > StructField("email", StringType, nullable=true) > )) > ) > val exceptDF = inputDF.transform( toProcessDF => > toProcessDF.filter( > ( > col("first_name").isin(Seq("john", "jane"): _*) > and col("last_name").isin(Seq("smith", "doe"): _*) > ) > or col("email").isin(List(): _*) > ) > ) > inputDF.except(exceptDF).show() > {code} > Output with Spark 2.2: > {noformat} > +---+--+-++ > | id|first_name|last_name| email| > +---+--+-++ > | 2| apache| spark|sp...@apache.org| > | 3| foo| bar| null| > +---+--+-++{noformat} > Output with Spark 2.3: > {noformat} > +---+--+-++ > | id|first_name|last_name| email| > +---+--+-++ > | 2| apache| spark|sp...@apache.org| > +---+--+-++{noformat} > Note, changing the last line to > {code:java} > inputDF.except(exceptDF.cache()).show() > {code} > produces identical output for both Spark 2.3 and 2.2 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-26366) Except with transform regression
[ https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-26366: Comment: was deleted (was: mgaido91 opened a new pull request #23315: [SPARK-26366][SQL] ReplaceExceptWithFilter should consider NULL as False URL: https://github.com/apache/spark/pull/23315 ## What changes were proposed in this pull request? In `ReplaceExceptWithFilter` we do not consider the case in which the condition returns NULL. Indeed, in that case, negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it: rows returning NULL are not returned. The PR fixes this problem by returning False for the condition when it is Null. In this way we do return all the rows which didn't satisfy it. ## How was this patch tested? added UTs This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org ) > Except with transform regression > > > Key: SPARK-26366 > URL: https://issues.apache.org/jira/browse/SPARK-26366 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.2 >Reporter: Dan Osipov >Assignee: Marco Gaido >Priority: Major > Labels: correctness > Fix For: 2.3.3, 2.4.1, 3.0.0 > > > There appears to be a regression between Spark 2.2 and 2.3. Below is the code > to reproduce it: > > {code:java} > import org.apache.spark.sql.functions.col > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val inputDF = spark.sqlContext.createDataFrame( > spark.sparkContext.parallelize(Seq( > Row("0", "john", "smith", "j...@smith.com"), > Row("1", "jane", "doe", "j...@doe.com"), > Row("2", "apache", "spark", "sp...@apache.org"), > Row("3", "foo", "bar", null) > )), > StructType(List( > StructField("id", StringType, nullable=true), > StructField("first_name", StringType, nullable=true), > StructField("last_name", StringType, nullable=true), > StructField("email", StringType, nullable=true) > )) > ) > val exceptDF = inputDF.transform( toProcessDF => > toProcessDF.filter( > ( > col("first_name").isin(Seq("john", "jane"): _*) > and col("last_name").isin(Seq("smith", "doe"): _*) > ) > or col("email").isin(List(): _*) > ) > ) > inputDF.except(exceptDF).show() > {code} > Output with Spark 2.2: > {noformat} > +---+--+-++ > | id|first_name|last_name| email| > +---+--+-++ > | 2| apache| spark|sp...@apache.org| > | 3| foo| bar| null| > +---+--+-++{noformat} > Output with Spark 2.3: > {noformat} > +---+--+-++ > | id|first_name|last_name| email| > +---+--+-++ > | 2| apache| spark|sp...@apache.org| > +---+--+-++{noformat} > Note, changing the last line to > {code:java} > inputDF.except(exceptDF.cache()).show() > {code} > produces identical output for both Spark 2.3 and 2.2 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-26366) Except with transform regression
[ https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-26366: Comment: was deleted (was: gatorsmile closed pull request #23350: [SPARK-26366][SQL][BACKPORT-2.3] ReplaceExceptWithFilter should consider NULL as False URL: https://github.com/apache/spark/pull/23350 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceExceptWithFilter.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceExceptWithFilter.scala index 45edf266bbce4..08cf16038a654 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceExceptWithFilter.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceExceptWithFilter.scala @@ -36,7 +36,8 @@ import org.apache.spark.sql.catalyst.rules.Rule * Note: * Before flipping the filter condition of the right node, we should: * 1. Combine all it's [[Filter]]. - * 2. Apply InferFiltersFromConstraints rule (to take into account of NULL values in the condition). + * 2. Update the attribute references to the left node; + * 3. Add a Coalesce(condition, False) (to take into account of NULL values in the condition). */ object ReplaceExceptWithFilter extends Rule[LogicalPlan] { @@ -47,23 +48,28 @@ object ReplaceExceptWithFilter extends Rule[LogicalPlan] { plan.transform { case e @ Except(left, right) if isEligible(left, right) => -val newCondition = transformCondition(left, skipProject(right)) -newCondition.map { c => - Distinct(Filter(Not(c), left)) -}.getOrElse { +val filterCondition = combineFilters(skipProject(right)).asInstanceOf[Filter].condition +if (filterCondition.deterministic) { + transformCondition(left, filterCondition).map { c => +Distinct(Filter(Not(c), left)) + }.getOrElse { +e + } +} else { e } } } - private def transformCondition(left: LogicalPlan, right: LogicalPlan): Option[Expression] = { -val filterCondition = - InferFiltersFromConstraints(combineFilters(right)).asInstanceOf[Filter].condition - -val attributeNameMap: Map[String, Attribute] = left.output.map(x => (x.name, x)).toMap - -if (filterCondition.references.forall(r => attributeNameMap.contains(r.name))) { - Some(filterCondition.transform { case a: AttributeReference => attributeNameMap(a.name) }) + private def transformCondition(plan: LogicalPlan, condition: Expression): Option[Expression] = { +val attributeNameMap: Map[String, Attribute] = plan.output.map(x => (x.name, x)).toMap +if (condition.references.forall(r => attributeNameMap.contains(r.name))) { + val rewrittenCondition = condition.transform { +case a: AttributeReference => attributeNameMap(a.name) + } + // We need to consider as False when the condition is NULL, otherwise we do not return those + // rows containing NULL which are instead filtered in the Except right plan + Some(Coalesce(Seq(rewrittenCondition, Literal.FalseLiteral))) } else { None } diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala index 52dc2e9fb076c..78d3969906e99 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala @@ -20,11 +20,12 @@ package org.apache.spark.sql.catalyst.optimizer import org.apache.spark.sql.Row import org.apache.spark.sql.catalyst.dsl.expressions._ import org.apache.spark.sql.catalyst.dsl.plans._ -import org.apache.spark.sql.catalyst.expressions.{Alias, Literal, Not} +import org.apache.spark.sql.catalyst.expressions.{Alias, Coalesce, If, Literal, Not} import org.apache.spark.sql.catalyst.expressions.aggregate.First import org.apache.spark.sql.catalyst.plans.{LeftAnti, LeftSemi, PlanTest} import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.rules.RuleExecutor +import org.apache.spark.sql.types.BooleanType class ReplaceOperatorSuite extends PlanTest { @@ -65,8 +66,7 @@ class ReplaceOperatorSuite extends PlanTest { val correctAnswer = Aggregate(table1.output, table1.output, -Filter(Not((attributeA.isNotNull && attributeB.isNotNull) && - (attributeA >= 2 && attributeB < 1)), +
[jira] [Commented] (SPARK-26366) Except with transform regression
[ https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734363#comment-16734363 ] Reynold Xin commented on SPARK-26366: - Please make sure you guys tag these tickets with correctness label! > Except with transform regression > > > Key: SPARK-26366 > URL: https://issues.apache.org/jira/browse/SPARK-26366 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.2 >Reporter: Dan Osipov >Assignee: Marco Gaido >Priority: Major > Labels: correctness > Fix For: 2.3.3, 2.4.1, 3.0.0 > > > There appears to be a regression between Spark 2.2 and 2.3. Below is the code > to reproduce it: > > {code:java} > import org.apache.spark.sql.functions.col > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val inputDF = spark.sqlContext.createDataFrame( > spark.sparkContext.parallelize(Seq( > Row("0", "john", "smith", "j...@smith.com"), > Row("1", "jane", "doe", "j...@doe.com"), > Row("2", "apache", "spark", "sp...@apache.org"), > Row("3", "foo", "bar", null) > )), > StructType(List( > StructField("id", StringType, nullable=true), > StructField("first_name", StringType, nullable=true), > StructField("last_name", StringType, nullable=true), > StructField("email", StringType, nullable=true) > )) > ) > val exceptDF = inputDF.transform( toProcessDF => > toProcessDF.filter( > ( > col("first_name").isin(Seq("john", "jane"): _*) > and col("last_name").isin(Seq("smith", "doe"): _*) > ) > or col("email").isin(List(): _*) > ) > ) > inputDF.except(exceptDF).show() > {code} > Output with Spark 2.2: > {noformat} > +---+--+-++ > | id|first_name|last_name| email| > +---+--+-++ > | 2| apache| spark|sp...@apache.org| > | 3| foo| bar| null| > +---+--+-++{noformat} > Output with Spark 2.3: > {noformat} > +---+--+-++ > | id|first_name|last_name| email| > +---+--+-++ > | 2| apache| spark|sp...@apache.org| > +---+--+-++{noformat} > Note, changing the last line to > {code:java} > inputDF.except(exceptDF.cache()).show() > {code} > produces identical output for both Spark 2.3 and 2.2 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26366) Except with transform regression
[ https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-26366: Labels: correctness (was: ) > Except with transform regression > > > Key: SPARK-26366 > URL: https://issues.apache.org/jira/browse/SPARK-26366 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.2 >Reporter: Dan Osipov >Assignee: Marco Gaido >Priority: Major > Labels: correctness > Fix For: 2.3.3, 2.4.1, 3.0.0 > > > There appears to be a regression between Spark 2.2 and 2.3. Below is the code > to reproduce it: > > {code:java} > import org.apache.spark.sql.functions.col > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val inputDF = spark.sqlContext.createDataFrame( > spark.sparkContext.parallelize(Seq( > Row("0", "john", "smith", "j...@smith.com"), > Row("1", "jane", "doe", "j...@doe.com"), > Row("2", "apache", "spark", "sp...@apache.org"), > Row("3", "foo", "bar", null) > )), > StructType(List( > StructField("id", StringType, nullable=true), > StructField("first_name", StringType, nullable=true), > StructField("last_name", StringType, nullable=true), > StructField("email", StringType, nullable=true) > )) > ) > val exceptDF = inputDF.transform( toProcessDF => > toProcessDF.filter( > ( > col("first_name").isin(Seq("john", "jane"): _*) > and col("last_name").isin(Seq("smith", "doe"): _*) > ) > or col("email").isin(List(): _*) > ) > ) > inputDF.except(exceptDF).show() > {code} > Output with Spark 2.2: > {noformat} > +---+--+-++ > | id|first_name|last_name| email| > +---+--+-++ > | 2| apache| spark|sp...@apache.org| > | 3| foo| bar| null| > +---+--+-++{noformat} > Output with Spark 2.3: > {noformat} > +---+--+-++ > | id|first_name|last_name| email| > +---+--+-++ > | 2| apache| spark|sp...@apache.org| > +---+--+-++{noformat} > Note, changing the last line to > {code:java} > inputDF.except(exceptDF.cache()).show() > {code} > produces identical output for both Spark 2.3 and 2.2 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-26366) Except with transform regression
[ https://issues.apache.org/jira/browse/SPARK-26366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-26366: Comment: was deleted (was: asfgit closed pull request #23315: [SPARK-26366][SQL] ReplaceExceptWithFilter should consider NULL as False URL: https://github.com/apache/spark/pull/23315 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceExceptWithFilter.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceExceptWithFilter.scala index efd3944eba7f5..4996d24dfd298 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceExceptWithFilter.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceExceptWithFilter.scala @@ -36,7 +36,8 @@ import org.apache.spark.sql.catalyst.rules.Rule * Note: * Before flipping the filter condition of the right node, we should: * 1. Combine all it's [[Filter]]. - * 2. Apply InferFiltersFromConstraints rule (to take into account of NULL values in the condition). + * 2. Update the attribute references to the left node; + * 3. Add a Coalesce(condition, False) (to take into account of NULL values in the condition). */ object ReplaceExceptWithFilter extends Rule[LogicalPlan] { @@ -47,23 +48,28 @@ object ReplaceExceptWithFilter extends Rule[LogicalPlan] { plan.transform { case e @ Except(left, right, false) if isEligible(left, right) => -val newCondition = transformCondition(left, skipProject(right)) -newCondition.map { c => - Distinct(Filter(Not(c), left)) -}.getOrElse { +val filterCondition = combineFilters(skipProject(right)).asInstanceOf[Filter].condition +if (filterCondition.deterministic) { + transformCondition(left, filterCondition).map { c => +Distinct(Filter(Not(c), left)) + }.getOrElse { +e + } +} else { e } } } - private def transformCondition(left: LogicalPlan, right: LogicalPlan): Option[Expression] = { -val filterCondition = - InferFiltersFromConstraints(combineFilters(right)).asInstanceOf[Filter].condition - -val attributeNameMap: Map[String, Attribute] = left.output.map(x => (x.name, x)).toMap - -if (filterCondition.references.forall(r => attributeNameMap.contains(r.name))) { - Some(filterCondition.transform { case a: AttributeReference => attributeNameMap(a.name) }) + private def transformCondition(plan: LogicalPlan, condition: Expression): Option[Expression] = { +val attributeNameMap: Map[String, Attribute] = plan.output.map(x => (x.name, x)).toMap +if (condition.references.forall(r => attributeNameMap.contains(r.name))) { + val rewrittenCondition = condition.transform { +case a: AttributeReference => attributeNameMap(a.name) + } + // We need to consider as False when the condition is NULL, otherwise we do not return those + // rows containing NULL which are instead filtered in the Except right plan + Some(Coalesce(Seq(rewrittenCondition, Literal.FalseLiteral))) } else { None } diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala index 3b1b2d588ef67..c8e15c7da763e 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala @@ -20,11 +20,12 @@ package org.apache.spark.sql.catalyst.optimizer import org.apache.spark.sql.Row import org.apache.spark.sql.catalyst.dsl.expressions._ import org.apache.spark.sql.catalyst.dsl.plans._ -import org.apache.spark.sql.catalyst.expressions.{Alias, Literal, Not} +import org.apache.spark.sql.catalyst.expressions.{Alias, Coalesce, If, Literal, Not} import org.apache.spark.sql.catalyst.expressions.aggregate.First import org.apache.spark.sql.catalyst.plans.{LeftAnti, LeftSemi, PlanTest} import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.rules.RuleExecutor +import org.apache.spark.sql.types.BooleanType class ReplaceOperatorSuite extends PlanTest { @@ -65,8 +66,7 @@ class ReplaceOperatorSuite extends PlanTest { val correctAnswer = Aggregate(table1.output, table1.output, -Filter(Not((attributeA.isNotNull && attributeB.isNotNull) && - (attributeA >= 2 && attributeB < 1)), +
[jira] [Issue Comment Deleted] (SPARK-26246) Infer timestamp types from JSON
[ https://issues.apache.org/jira/browse/SPARK-26246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-26246: Comment: was deleted (was: asfgit closed pull request #23201: [SPARK-26246][SQL] Inferring TimestampType from JSON URL: https://github.com/apache/spark/pull/23201 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala index 263e05de32075..d1bc00c08c1c6 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala @@ -28,7 +28,7 @@ import org.apache.spark.rdd.RDD import org.apache.spark.sql.catalyst.analysis.TypeCoercion import org.apache.spark.sql.catalyst.expressions.ExprUtils import org.apache.spark.sql.catalyst.json.JacksonUtils.nextUntil -import org.apache.spark.sql.catalyst.util.{DropMalformedMode, FailFastMode, ParseMode, PermissiveMode} +import org.apache.spark.sql.catalyst.util._ import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types._ import org.apache.spark.util.Utils @@ -37,6 +37,12 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable { private val decimalParser = ExprUtils.getDecimalParser(options.locale) + @transient + private lazy val timestampFormatter = TimestampFormatter( +options.timestampFormat, +options.timeZone, +options.locale) + /** * Infer the type of a collection of json records in three stages: * 1. Infer the type of each record @@ -115,13 +121,19 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable { // record fields' types have been combined. NullType - case VALUE_STRING if options.prefersDecimal => + case VALUE_STRING => +val field = parser.getText val decimalTry = allCatch opt { - val bigDecimal = decimalParser(parser.getText) + val bigDecimal = decimalParser(field) DecimalType(bigDecimal.precision, bigDecimal.scale) } -decimalTry.getOrElse(StringType) - case VALUE_STRING => StringType +if (options.prefersDecimal && decimalTry.isDefined) { + decimalTry.get +} else if ((allCatch opt timestampFormatter.parse(field)).isDefined) { + TimestampType +} else { + StringType +} case START_OBJECT => val builder = Array.newBuilder[StructField] diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JsonInferSchemaSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JsonInferSchemaSuite.scala new file mode 100644 index 0..9307f9b47b807 --- /dev/null +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JsonInferSchemaSuite.scala @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.json + +import com.fasterxml.jackson.core.JsonFactory + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.plans.SQLHelper +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ + +class JsonInferSchemaSuite extends SparkFunSuite with SQLHelper { + + def checkType(options: Map[String, String], json: String, dt: DataType): Unit = { +val jsonOptions = new JSONOptions(options, "UTC", "") +val inferSchema = new JsonInferSchema(jsonOptions) +val factory = new JsonFactory() +jsonOptions.setJacksonOptions(factory) +val parser = CreateJacksonParser.string(factory, json) +parser.nextToken() +val expectedType = StructType(Seq(StructField("a", dt, true))) + +assert(inferSchema.inferField(parser) === expectedType) + } + + def
[jira] [Comment Edited] (SPARK-26246) Infer timestamp types from JSON
[ https://issues.apache.org/jira/browse/SPARK-26246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734361#comment-16734361 ] Reynold Xin edited comment on SPARK-26246 at 1/4/19 5:22 PM: - Is there an option flag for this? This is a breaking change for people, and we need a way to fallback. was (Author: rxin): |Is there an option flag for this? This is a breaking change for people, and we need a way to fallback.| > Infer timestamp types from JSON > --- > > Key: SPARK-26246 > URL: https://issues.apache.org/jira/browse/SPARK-26246 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > Currently, TimestampType cannot be inferred from JSON. To parse JSON string, > you have to specify schema explicitly if JSON input contains timestamps. This > ticket aims to extend JsonInferSchema to support such inferring. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26246) Infer timestamp types from JSON
[ https://issues.apache.org/jira/browse/SPARK-26246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734361#comment-16734361 ] Reynold Xin commented on SPARK-26246: - |Is there an option flag for this? This is a breaking change for people, and we need a way to fallback.| > Infer timestamp types from JSON > --- > > Key: SPARK-26246 > URL: https://issues.apache.org/jira/browse/SPARK-26246 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > Currently, TimestampType cannot be inferred from JSON. To parse JSON string, > you have to specify schema explicitly if JSON input contains timestamps. This > ticket aims to extend JsonInferSchema to support such inferring. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26362) Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark contexts
[ https://issues.apache.org/jira/browse/SPARK-26362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734355#comment-16734355 ] Reynold Xin commented on SPARK-26362: - [~hyukjin.kwon] please make sure we add releasenotes label to tickets like this. > Remove 'spark.driver.allowMultipleContexts' to disallow multiple Spark > contexts > --- > > Key: SPARK-26362 > URL: https://issues.apache.org/jira/browse/SPARK-26362 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: releasenotes > Fix For: 3.0.0 > > > Multiple Spark contexts are discouraged and it has been warning from 4 years > ago (see SPARK-4180). > It could cause arbitrary and mysterious error cases. (Honestly, I didn't even > know Spark allows it). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26536) Upgrade Mockito to 2.23.4
[ https://issues.apache.org/jira/browse/SPARK-26536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26536: -- Description: This issue upgrades Mockito from 1.10.19 to 2.23.4. The following changes are required. - Replace `org.mockito.Matchers` with `org.mockito.ArgumentMatchers` - Replace `anyObject` with `any` - Replace `getArgumentAt` with `getArgument` and add type annotation. - Use `isNull` matcher in case of `null` is invoked. {code} saslHandler.channelInactive(null); -verify(handler).channelInactive(any(TransportClient.class)); +verify(handler).channelInactive(isNull()); {code} - Make and use `doReturn` wrapper to avoid [SI-4775|https://issues.scala-lang.org/browse/SI-4775] {code} private def doReturn(value: Any) = org.mockito.Mockito.doReturn(value, Seq.empty: _*) {code} > Upgrade Mockito to 2.23.4 > - > > Key: SPARK-26536 > URL: https://issues.apache.org/jira/browse/SPARK-26536 > Project: Spark > Issue Type: Sub-task > Components: Build, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > This issue upgrades Mockito from 1.10.19 to 2.23.4. The following changes are > required. > - Replace `org.mockito.Matchers` with `org.mockito.ArgumentMatchers` > - Replace `anyObject` with `any` > - Replace `getArgumentAt` with `getArgument` and add type annotation. > - Use `isNull` matcher in case of `null` is invoked. > {code} > saslHandler.channelInactive(null); > -verify(handler).channelInactive(any(TransportClient.class)); > +verify(handler).channelInactive(isNull()); > {code} > - Make and use `doReturn` wrapper to avoid > [SI-4775|https://issues.scala-lang.org/browse/SI-4775] > {code} > private def doReturn(value: Any) = org.mockito.Mockito.doReturn(value, > Seq.empty: _*) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26536) Upgrade Mockito to 2.23.4
[ https://issues.apache.org/jira/browse/SPARK-26536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26536: Assignee: (was: Apache Spark) > Upgrade Mockito to 2.23.4 > - > > Key: SPARK-26536 > URL: https://issues.apache.org/jira/browse/SPARK-26536 > Project: Spark > Issue Type: Sub-task > Components: Build, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26536) Upgrade Mockito to 2.23.4
[ https://issues.apache.org/jira/browse/SPARK-26536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26536: Assignee: Apache Spark > Upgrade Mockito to 2.23.4 > - > > Key: SPARK-26536 > URL: https://issues.apache.org/jira/browse/SPARK-26536 > Project: Spark > Issue Type: Sub-task > Components: Build, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26536) Upgrade Mockito to 2.23.4
[ https://issues.apache.org/jira/browse/SPARK-26536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26536: Assignee: Apache Spark > Upgrade Mockito to 2.23.4 > - > > Key: SPARK-26536 > URL: https://issues.apache.org/jira/browse/SPARK-26536 > Project: Spark > Issue Type: Sub-task > Components: Build, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26536) Upgrade Mockito to 2.23.4
[ https://issues.apache.org/jira/browse/SPARK-26536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26536: -- Component/s: Tests > Upgrade Mockito to 2.23.4 > - > > Key: SPARK-26536 > URL: https://issues.apache.org/jira/browse/SPARK-26536 > Project: Spark > Issue Type: Sub-task > Components: Build, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26536) Upgrade Mockito to 2.23.4
Dongjoon Hyun created SPARK-26536: - Summary: Upgrade Mockito to 2.23.4 Key: SPARK-26536 URL: https://issues.apache.org/jira/browse/SPARK-26536 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26535) Parsing literals as DOUBLE instead of DECIMAL
Marco Gaido created SPARK-26535: --- Summary: Parsing literals as DOUBLE instead of DECIMAL Key: SPARK-26535 URL: https://issues.apache.org/jira/browse/SPARK-26535 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Marco Gaido As pointed out by [~dkbiswal]'s comment https://github.com/apache/spark/pull/22450#issuecomment-423082389, most of other RDBMS (DB2, Presto, Hive, MSSQL) consider literals as DOUBLE by default. Spark as of now consider them as DECIMAL. This is quite problematic especially in relation with the operations on decimal, for which we base our implementation on Hive/MSSQL. So this ticket is for moving by default the resolution of literals as DOUBLE, but with a config which allows to get back to the previous behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26436) Dataframe resulting from a GroupByKey and flatMapGroups operation throws java.lang.UnsupportedException when groupByKey is applied on it.
[ https://issues.apache.org/jira/browse/SPARK-26436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734267#comment-16734267 ] Manish commented on SPARK-26436: [~viirya]: Thanks for the explanation. But should groupByKey on rows with GenericSchema not work, when a repartition on the same dataframe and same column succeeds. > Dataframe resulting from a GroupByKey and flatMapGroups operation throws > java.lang.UnsupportedException when groupByKey is applied on it. > - > > Key: SPARK-26436 > URL: https://issues.apache.org/jira/browse/SPARK-26436 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Manish >Priority: Major > > There seems to be a bug on groupByKey api for cases when it (groupByKey) is > applied on a DataSet resulting from a former groupByKey and flatMapGroups > invocation. > In such cases groupByKey throws the following exception: > java.lang.UnsupportedException: fieldIndex on a Row without schema is > undefined. > > Although the dataframe has a valid schema and a groupBy("key") or > repartition($"key") api calls on the same Dataframe and key succeed. > > Following is the code that reproduces the scenario: > > {code:scala} > >import org.apache.spark.sql.catalyst.encoders.RowEncoder > import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.types.{ IntegerType, StructField, StructType} > import scala.collection.mutable.ListBuffer > object Test { > def main(args: Array[String]): Unit = { > val values = List(List("1", "One") ,List("1", "Two") ,List("2", > "Three"),List("2","4")).map(x =>(x(0), x(1))) > val session = SparkSession.builder.config("spark.master", > "local").getOrCreate > import session.implicits._ > val dataFrame = values.toDF > dataFrame.show() > dataFrame.printSchema() > val newSchema = StructType(dataFrame.schema.fields > ++ Array( > StructField("Count", IntegerType, false) > ) > ) > val expr = RowEncoder.apply(newSchema) > val tranform = dataFrame.groupByKey(row => > row.getAs[String]("_1")).flatMapGroups((key, inputItr) => { > val inputSeq = inputItr.toSeq > val length = inputSeq.size > var listBuff = new ListBuffer[Row]() > var counter : Int= 0 > for(i <- 0 until(length)) > { > counter+=1 > } > for(i <- 0 until length ) { > var x = inputSeq(i) > listBuff += Row.fromSeq(x.toSeq ++ Array[Int](counter)) > } > listBuff.iterator > })(expr) > tranform.show > val newSchema1 = StructType(tranform.schema.fields > ++ Array( > StructField("Count1", IntegerType, false) > ) > ) > val expr1 = RowEncoder.apply(newSchema1) > val tranform2 = tranform.groupByKey(row => > row.getAs[String]("_1")).flatMapGroups((key, inputItr) => { > val inputSeq = inputItr.toSeq > val length = inputSeq.size > var listBuff = new ListBuffer[Row]() > var counter : Int= 0 > for(i <- 0 until(length)) > { > counter+=1 > } > for(i <- 0 until length ) { > var x = inputSeq(i) > listBuff += Row.fromSeq(x.toSeq ++ Array[Int](counter)) > } > listBuff.iterator > })(expr1) > tranform2.show > } > } > Test.main(null) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26436) Dataframe resulting from a GroupByKey and flatMapGroups operation throws java.lang.UnsupportedException when groupByKey is applied on it.
[ https://issues.apache.org/jira/browse/SPARK-26436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734282#comment-16734282 ] Liang-Chi Hsieh commented on SPARK-26436: - Sorry I don't know what you mean "should groupByKey on rows with GenericSchema not work, when a repartition on the same dataframe and same column succeeds.". Can you elaborate it? > Dataframe resulting from a GroupByKey and flatMapGroups operation throws > java.lang.UnsupportedException when groupByKey is applied on it. > - > > Key: SPARK-26436 > URL: https://issues.apache.org/jira/browse/SPARK-26436 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Manish >Priority: Major > > There seems to be a bug on groupByKey api for cases when it (groupByKey) is > applied on a DataSet resulting from a former groupByKey and flatMapGroups > invocation. > In such cases groupByKey throws the following exception: > java.lang.UnsupportedException: fieldIndex on a Row without schema is > undefined. > > Although the dataframe has a valid schema and a groupBy("key") or > repartition($"key") api calls on the same Dataframe and key succeed. > > Following is the code that reproduces the scenario: > > {code:scala} > >import org.apache.spark.sql.catalyst.encoders.RowEncoder > import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.types.{ IntegerType, StructField, StructType} > import scala.collection.mutable.ListBuffer > object Test { > def main(args: Array[String]): Unit = { > val values = List(List("1", "One") ,List("1", "Two") ,List("2", > "Three"),List("2","4")).map(x =>(x(0), x(1))) > val session = SparkSession.builder.config("spark.master", > "local").getOrCreate > import session.implicits._ > val dataFrame = values.toDF > dataFrame.show() > dataFrame.printSchema() > val newSchema = StructType(dataFrame.schema.fields > ++ Array( > StructField("Count", IntegerType, false) > ) > ) > val expr = RowEncoder.apply(newSchema) > val tranform = dataFrame.groupByKey(row => > row.getAs[String]("_1")).flatMapGroups((key, inputItr) => { > val inputSeq = inputItr.toSeq > val length = inputSeq.size > var listBuff = new ListBuffer[Row]() > var counter : Int= 0 > for(i <- 0 until(length)) > { > counter+=1 > } > for(i <- 0 until length ) { > var x = inputSeq(i) > listBuff += Row.fromSeq(x.toSeq ++ Array[Int](counter)) > } > listBuff.iterator > })(expr) > tranform.show > val newSchema1 = StructType(tranform.schema.fields > ++ Array( > StructField("Count1", IntegerType, false) > ) > ) > val expr1 = RowEncoder.apply(newSchema1) > val tranform2 = tranform.groupByKey(row => > row.getAs[String]("_1")).flatMapGroups((key, inputItr) => { > val inputSeq = inputItr.toSeq > val length = inputSeq.size > var listBuff = new ListBuffer[Row]() > var counter : Int= 0 > for(i <- 0 until(length)) > { > counter+=1 > } > for(i <- 0 until length ) { > var x = inputSeq(i) > listBuff += Row.fromSeq(x.toSeq ++ Array[Int](counter)) > } > listBuff.iterator > })(expr1) > tranform2.show > } > } > Test.main(null) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26534) Closure Cleaner Bug
sam created SPARK-26534: --- Summary: Closure Cleaner Bug Key: SPARK-26534 URL: https://issues.apache.org/jira/browse/SPARK-26534 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.1 Reporter: sam I've found a strange combination of closures where the closure cleaner doesn't seem to be smart enough to figure out how to remove a reference that is not used. I.e. we get a `org.apache.spark.SparkException: Task not serializable` for a Task that is perfectly serializable. In the example below, the only `val` that is actually needed for the closure of the `map` is `foo`, but it tries to serialise `thingy`. What is odd is changing this code in a number of subtle ways eliminates the error, which I've tried to highlight using comments inline. {code:java} import org.apache.spark.sql._ object Test { val sparkSession: SparkSession = SparkSession.builder.master("local").appName("app").getOrCreate() def apply(): Unit = { import sparkSession.implicits._ val landedData: Dataset[String] = sparkSession.sparkContext.makeRDD(Seq("foo", "bar")).toDS() // thingy has to be in this outer scope to reproduce, if in someFunc, cannot reproduce val thingy: Thingy = new Thingy // If not wrapped in someFunc cannot reproduce val someFunc = () => { // If don't reference this foo inside the closer (e.g. just use identity function) cannot reproduce val foo: String = "foo" thingy.run(block = () => { landedData.map(r => { r + foo }) .count() }) } someFunc() } } class Thingy { def run[R](block: () => R): R = { block() } } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8602) Shared cached DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734216#comment-16734216 ] Hyukjin Kwon commented on SPARK-8602: - I think we now have one Spark context and multiple Spark sessions. If we use multiple Spark Sessions, then we're able to do this. Let me leave this resolved. Please reopen this if I am mistaken. > Shared cached DataFrames > > > Key: SPARK-8602 > URL: https://issues.apache.org/jira/browse/SPARK-8602 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.4.0 >Reporter: John Muller >Priority: Major > > Currently, the only way I can think of to share HiveContexts, SparkContexts, > or cached DataFrames is to use spark-jobserver and spark-jobserver-extras: > https://gist.github.com/anonymous/578385766261d6fa7196#file-exampleshareddf-scala > But HiveServer2 users over plain JDBC cannot access the shared dataframe. > Request is to add this directly to SparkSQL and treat it like a shared temp > table Ex. > SELECT a, b, c > FROM TableA > CACHE DATAFRAME > This would be very useful for Rollups and Cubes, though I'm not sure what > this may mean for HiveMetaStore. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8427) Incorrect ACL checking for partitioned table in Spark SQL-1.4
[ https://issues.apache.org/jira/browse/SPARK-8427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-8427. - Resolution: Not A Problem We're heading for Spark 3.0 and the issue was Spark 1.4. The codes have been drastically changed and the information here is obsolete. Please reopen if this problem is persistent in upper Spark versions as well. > Incorrect ACL checking for partitioned table in Spark SQL-1.4 > - > > Key: SPARK-8427 > URL: https://issues.apache.org/jira/browse/SPARK-8427 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 > Environment: CentOS 6 & OS X 10.9.5, Hive-0.13.1, Spark-1.4, Hadoop > 2.6.0 >Reporter: Karthik Subramanian >Priority: Critical > Labels: security > > Problem Statement: > While doing query on a partitioned table using Spark SQL (Version 1.4.0), > access denied exception is observed on the partition the user doesn’t belong > to (The user permission is controlled using HDF ACLs). The same works > correctly in hive. > Usercase: To address Multitenancy > Consider a table containing multiple customers and each customer with > multiple facility. The table is partitioned by customer and facility. The > user belonging to on facility will not have access to other facility. This is > enforced using HDFS ACLs on corresponding directories. When querying on the > table as ‘user1’ belonging to ‘facility1’ and ‘customer1’ on the particular > partition (using ‘where’ clause) only the corresponding directory access > should be verified and not the entire table. > The above use case works as expected when using HIVE client, version 0.13.1 & > 1.1.0. > The query used: select count(*) from customertable where customer=‘customer1’ > and facility=‘facility1’ > Below is the exception received in Spark-shell: > org.apache.hadoop.security.AccessControlException: Permission denied: > user=user1, access=READ_EXECUTE, > inode="/data/customertable/customer=customer2/facility=facility2”:root:supergroup:drwxrwx---:group::r-x,group:facility2:rwx > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkAccessAcl(FSPermissionChecker.java:351) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:253) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:185) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6512) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6494) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPathAccess(FSNamesystem.java:6419) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListingInt(FSNamesystem.java:4954) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing(FSNamesystem.java:4915) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getListing(NameNodeRpcServer.java:826) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:612) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1971) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1952) > at >
[jira] [Commented] (SPARK-9218) Falls back to getAllPartitions when getPartitionsByFilter fails
[ https://issues.apache.org/jira/browse/SPARK-9218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734218#comment-16734218 ] Hyukjin Kwon commented on SPARK-9218: - ping [~lian cheng] > Falls back to getAllPartitions when getPartitionsByFilter fails > --- > > Key: SPARK-9218 > URL: https://issues.apache.org/jira/browse/SPARK-9218 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Major > > [PR #7492|https://github.com/apache/spark/pull/7492] enables Hive partition > predicate push-down by leveraging {{Hive.getPartitionsByFilter}}. Although > this optimization is pretty effective, we did observe some failures like this: > {noformat} > java.sql.SQLDataException: Invalid character string format for type DECIMAL. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown > Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown > Source) > at > org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(Unknown > Source) > at > org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeQuery(Unknown Source) > at > com.jolbox.bonecp.PreparedStatementHandle.executeQuery(PreparedStatementHandle.java:174) > at > org.datanucleus.store.rdbms.ParamLoggingPreparedStatement.executeQuery(ParamLoggingPreparedStatement.java:381) > at > org.datanucleus.store.rdbms.SQLController.executeStatementQuery(SQLController.java:504) > at > org.datanucleus.store.rdbms.query.SQLQuery.performExecute(SQLQuery.java:280) > at org.datanucleus.store.query.Query.executeQuery(Query.java:1786) > at > org.datanucleus.store.query.AbstractSQLQuery.executeWithArray(AbstractSQLQuery.java:339) > at org.datanucleus.api.jdo.JDOQuery.executeWithArray(JDOQuery.java:312) > at > org.apache.hadoop.hive.metastore.MetaStoreDirectSql.getPartitionsViaSqlFilterInternal(MetaStoreDirectSql.java:300) > at > org.apache.hadoop.hive.metastore.MetaStoreDirectSql.getPartitionsViaSqlFilter(MetaStoreDirectSql.java:211) > at > org.apache.hadoop.hive.metastore.ObjectStore$4.getSqlResult(ObjectStore.java:2320) > at > org.apache.hadoop.hive.metastore.ObjectStore$4.getSqlResult(ObjectStore.java:2317) > at > org.apache.hadoop.hive.metastore.ObjectStore$GetHelper.run(ObjectStore.java:2208) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilterInternal(ObjectStore.java:2317) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilter(ObjectStore.java:2165) > at sun.reflect.GeneratedMethodAccessor126.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:108) > at com.sun.proxy.$Proxy21.getPartitionsByFilter(Unknown Source) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions_by_filter(HiveMetaStore.java:3760) > at sun.reflect.GeneratedMethodAccessor125.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105) > at com.sun.proxy.$Proxy23.get_partitions_by_filter(Unknown Source) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(HiveMetaStoreClient.java:903) > at sun.reflect.GeneratedMethodAccessor124.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) > at com.sun.proxy.$Proxy24.listPartitionsByFilter(Unknown Source) > at > org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByFilter(Hive.java:1944) > at sun.reflect.GeneratedMethodAccessor123.invoke(Unknown Source) > at >
[jira] [Resolved] (SPARK-8370) Add API for data sources to register databases
[ https://issues.apache.org/jira/browse/SPARK-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-8370. - Resolution: Not A Problem > Add API for data sources to register databases > -- > > Key: SPARK-8370 > URL: https://issues.apache.org/jira/browse/SPARK-8370 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Santiago M. Mola >Priority: Major > > This API would allow to register a database with a data source instead of > just a table. Registering a data source database would register all its table > and maintain the catalog updated. The catalog could delegate to the data > source lookups of tables for a database registered with this API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8602) Shared cached DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-8602. - Resolution: Not A Problem > Shared cached DataFrames > > > Key: SPARK-8602 > URL: https://issues.apache.org/jira/browse/SPARK-8602 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.4.0 >Reporter: John Muller >Priority: Major > > Currently, the only way I can think of to share HiveContexts, SparkContexts, > or cached DataFrames is to use spark-jobserver and spark-jobserver-extras: > https://gist.github.com/anonymous/578385766261d6fa7196#file-exampleshareddf-scala > But HiveServer2 users over plain JDBC cannot access the shared dataframe. > Request is to add this directly to SparkSQL and treat it like a shared temp > table Ex. > SELECT a, b, c > FROM TableA > CACHE DATAFRAME > This would be very useful for Rollups and Cubes, though I'm not sure what > this may mean for HiveMetaStore. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8577) ScalaReflectionLock.synchronized can cause deadlock
[ https://issues.apache.org/jira/browse/SPARK-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-8577. - Resolution: Not A Problem I'm leaving resolved since we removed 2.10. > ScalaReflectionLock.synchronized can cause deadlock > --- > > Key: SPARK-8577 > URL: https://issues.apache.org/jira/browse/SPARK-8577 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: koert kuipers >Priority: Minor > > Just a heads up, i was doing some basic coding using DataFrame, Row, > StructType, etc. in my own project and i ended up with deadlocks in my sbt > tests due to the usage of ScalaReflectionLock.synchronized in the spark sql > code. > the issue went away when i changed my build to have: > parallelExecution in Test := false > so that the tests run consecutively... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8403) Pruner partition won't effective when partition field and fieldSchema exist in sql predicate
[ https://issues.apache.org/jira/browse/SPARK-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-8403. - Resolution: Incomplete Leaving this resolved due to reporter's inactivity. > Pruner partition won't effective when partition field and fieldSchema exist > in sql predicate > > > Key: SPARK-8403 > URL: https://issues.apache.org/jira/browse/SPARK-8403 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Hong Shen >Priority: Major > > When partition field and fieldSchema exist in sql predicates, pruner > partition won't effective. > Here is the sql, > {code} > select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r > where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)>16) > {code} > Table t_dw_qqlive_209026 is partition by imp_date, itimestamp is a > fieldSchema in t_dw_qqlive_209026. > When run on hive, it will only scan data in partition 20150615, but if run on > spark sql, it will scan the whole table t_dw_qqlive_209026. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8370) Add API for data sources to register databases
[ https://issues.apache.org/jira/browse/SPARK-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734205#comment-16734205 ] Hyukjin Kwon commented on SPARK-8370: - I think we're working on multiple catalog support. I am resolving this. Please reopen this if I am not mistaken. Otherwise, let's track this in DatasourceV2 > Add API for data sources to register databases > -- > > Key: SPARK-8370 > URL: https://issues.apache.org/jira/browse/SPARK-8370 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Santiago M. Mola >Priority: Major > > This API would allow to register a database with a data source instead of > just a table. Registering a data source database would register all its table > and maintain the catalog updated. The catalog could delegate to the data > source lookups of tables for a database registered with this API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7754) [SQL] Use PartialFunction literals instead of objects in Catalyst
[ https://issues.apache.org/jira/browse/SPARK-7754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734197#comment-16734197 ] Hyukjin Kwon commented on SPARK-7754: - I don't think we're going to change if it targets mainly readability. It's been few years already and looks okay. I'm resolving this since there look not a lot of interests about this. Please reopen this if I am mistaken. > [SQL] Use PartialFunction literals instead of objects in Catalyst > - > > Key: SPARK-7754 > URL: https://issues.apache.org/jira/browse/SPARK-7754 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Edoardo Vacchi >Priority: Minor > > Catalyst rules extend two distinct "rule" types: {{Rule[LogicalPlan]}} and > {{Strategy}} (which is an alias for {{GenericStrategy[SparkPlan]}}). > The distinction is fairly subtle: in the end, both rule types are supposed to > define a method {{apply(plan: LogicalPlan)}} > (where LogicalPlan is either Logical- or Spark-) which returns a transformed > plan (or a sequence thereof, in the case > of Strategy). > Ceremonies asides, the body of such method is always of the kind: > {code:java} > def apply(plan: PlanType) = plan match pf > {code} > where `pf` would be some `PartialFunction` of the PlanType: > {code:java} > val pf = { > case ... => ... > } > {code} > This is JIRA is a proposal to introduce utility methods to > a) reduce the boilerplate to define rewrite rules > b) turning them back into what they essentially represent: function types. > These changes would be backwards compatible, and would greatly help in > understanding what the code does. Current use of objects is redundant and > possibly confusing. > *{{Rule[LogicalPlan]}}* > a) Introduce the utility object > {code:java} > object rule > def rule(pf: PartialFunction[LogicalPlan, LogicalPlan]): > Rule[LogicalPlan] = > new Rule[LogicalPlan] { > def apply (plan: LogicalPlan): LogicalPlan = plan transform pf > } > def named(name: String)(pf: PartialFunction[LogicalPlan, LogicalPlan]): > Rule[LogicalPlan] = > new Rule[LogicalPlan] { > override val ruleName = name > def apply (plan: LogicalPlan): LogicalPlan = plan transform pf > } > {code} > b) progressively replace the boilerplate-y object definitions; e.g. > {code:java} > object MyRewriteRule extends Rule[LogicalPlan] { > def apply(plan: LogicalPlan): LogicalPlan = plan transform { > case ... => ... > } > {code} > with > {code:java} > // define a Rule[LogicalPlan] > val MyRewriteRule = rule { > case ... => ... > } > {code} > and/or : > {code:java} > // define a named Rule[LogicalPlan] > val MyRewriteRule = rule.named("My rewrite rule") { > case ... => ... > } > {code} > *Strategies* > A similar solution could be applied to shorten the code for > Strategies, which are total functions > only because they are all supposed to manage the default case, > possibly returning `Nil`. In this case > we might introduce the following utility: > {code:java} > object strategy { > /** >* Generate a Strategy from a PartialFunction[LogicalPlan, SparkPlan]. >* The partial function must therefore return *one single* SparkPlan for > each case. >* The method will automatically wrap them in a [[Seq]]. >* Unhandled cases will automatically return Seq.empty >*/ > def apply(pf: PartialFunction[LogicalPlan, SparkPlan]): Strategy = > new Strategy { > def apply(plan: LogicalPlan): Seq[SparkPlan] = > if (pf.isDefinedAt(plan)) Seq(pf.apply(plan)) else Seq.empty > } > /** >* Generate a Strategy from a PartialFunction[ LogicalPlan, Seq[SparkPlan] > ]. >* The partial function must therefore return a Seq[SparkPlan] for each > case. >* Unhandled cases will automatically return Seq.empty >*/ > def seq(pf: PartialFunction[LogicalPlan, Seq[SparkPlan]]): Strategy = > new Strategy { > def apply(plan: LogicalPlan): Seq[SparkPlan] = > if (pf.isDefinedAt(plan)) pf.apply(plan) else Seq.empty[SparkPlan] > } > } > {code} > Usage: > {code:java} > val mystrategy = strategy { case ... => ... } > val seqstrategy = strategy.seq { case ... => ... } > {code} > *Further possible improvements:* > Making the utility methods `implicit`, thereby > further reducing the rewrite rules to: > {code:java} > // define a PartialFunction[LogicalPlan, LogicalPlan] > // the implicit would convert it into a Rule[LogicalPlan] at the use sites > val MyRewriteRule = { > case ... => ... > } > {code} > *Caveats* > Because of the way objects are initialized vs. vals, it might be necessary > reorder instructions so that vals are actually initialized before they