[jira] [Resolved] (SPARK-29612) ALTER TABLE (RECOVER PARTITIONS) should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29612. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26269 [https://github.com/apache/spark/pull/26269] > ALTER TABLE (RECOVER PARTITIONS) should look up catalog/table like v2 commands > -- > > Key: SPARK-29612 > URL: https://issues.apache.org/jira/browse/SPARK-29612 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > > ALTER TABLE (RECOVER PARTITIONS) should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29612) ALTER TABLE (RECOVER PARTITIONS) should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29612: --- Assignee: Huaxin Gao > ALTER TABLE (RECOVER PARTITIONS) should look up catalog/table like v2 commands > -- > > Key: SPARK-29612 > URL: https://issues.apache.org/jira/browse/SPARK-29612 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > > ALTER TABLE (RECOVER PARTITIONS) should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29573) Spark should work as PostgreSQL when using + Operator
[ https://issues.apache.org/jira/browse/SPARK-29573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK KUMAR GUPTA updated SPARK-29573: - Description: Spark and PostgreSQL result is different when concatenating as below : {code} Spark : Giving NULL result 0: jdbc:hive2://10.18.19.208:23040/default> select * from emp12; +-+-+ | id | name | +-+-+ | 20 | test | | 10 | number | +-+-+ 2 rows selected (3.683 seconds) 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as address from emp12; +-+--+ | ID | address | +-+--+ | 20 | NULL | | 10 | NULL | +-+--+ 2 rows selected (0.649 seconds) 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as address from emp12; +-+--+ | ID | address | +-+--+ | 20 | NULL | | 10 | NULL | +-+--+ 2 rows selected (0.406 seconds) 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+','+name as address from emp12; +-+--+ | ID | address | +-+--+ | 20 | NULL | | 10 | NULL | +-+--+ PostgreSQL: Saying throwing Error saying not supported create table emp12(id int,name varchar(255)); insert into emp12 values(10,'number'); insert into emp12 values(20,'test'); select id as ID, id+','+name as address from emp12; output: invalid input syntax for integer: "," create table emp12(id int,name varchar(255)); insert into emp12 values(10,'number'); insert into emp12 values(20,'test'); select id as ID, id+name as address from emp12; Output: 42883: operator does not exist: integer + character varying {code} was: Spark and PostgreSQL result is different when concatenating as below : Spark : Giving NULL result {code} 0: jdbc:hive2://10.18.19.208:23040/default> select * from emp12; +-+-+ | id | name | +-+-+ | 20 | test | | 10 | number | +-+-+ 2 rows selected (3.683 seconds) 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as address from emp12; +-+--+ | ID | address | +-+--+ | 20 | NULL | | 10 | NULL | +-+--+ 2 rows selected (0.649 seconds) 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as address from emp12; +-+--+ | ID | address | +-+--+ | 20 | NULL | | 10 | NULL | +-+--+ 2 rows selected (0.406 seconds) 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+','+name as address from emp12; +-+--+ | ID | address | +-+--+ | 20 | NULL | | 10 | NULL | +-+--+ {code} PostgreSQL: Saying throwing Error saying not supported {code} create table emp12(id int,name varchar(255)); insert into emp12 values(10,'number'); insert into emp12 values(20,'test'); select id as ID, id+','+name as address from emp12; output: invalid input syntax for integer: "," create table emp12(id int,name varchar(255)); insert into emp12 values(10,'number'); insert into emp12 values(20,'test'); select id as ID, id+name as address from emp12; Output: 42883: operator does not exist: integer + character varying {code} It should throw Error in Spark if it is not supported. > Spark should work as PostgreSQL when using + Operator > - > > Key: SPARK-29573 > URL: https://issues.apache.org/jira/browse/SPARK-29573 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Spark and PostgreSQL result is different when concatenating as below : > {code} > Spark : Giving NULL result > 0: jdbc:hive2://10.18.19.208:23040/default> select * from emp12; > +-+-+ > | id | name | > +-+-+ > | 20 | test | > | 10 | number | > +-+-+ > 2 rows selected (3.683 seconds) > 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as > address from emp12; > +-+--+ > | ID | address | > +-+--+ > | 20 | NULL | > | 10 | NULL | > +-+--+ > 2 rows selected (0.649 seconds) > 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as > address from emp12; > +-+--+ > | ID | address | > +-+--+ > | 20 | NULL | > | 10 | NULL | > +-+--+ > 2 rows selected (0.406 seconds) > 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+','+name as > address from emp12; > +-+--+ > | ID | address | > +-+--+ > | 20 | NULL | > | 10 | NULL | > +-+--+ > PostgreSQL: Saying throwing Error saying not supported > create table emp12(id int,name varchar(255)); > insert into emp12 values(10,'number'); > insert into emp12 values(20,'test'); > select id as ID, id+','+name as address from emp12; > output: invalid input
[jira] [Commented] (SPARK-24918) Executor Plugin API
[ https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961672#comment-16961672 ] Brandon commented on SPARK-24918: - [~nsheth] placing the plugin class inside a jar and passing as `–jars` to spark-submit should sufficient, right? It seems this is not enough to make the class visible to the executor. I have had to explicitly add this jar to `spark.executor.extraClassPath` for plugins to load correctly. > Executor Plugin API > --- > > Key: SPARK-24918 > URL: https://issues.apache.org/jira/browse/SPARK-24918 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Nihar Sheth >Priority: Major > Labels: SPIP, memory-analysis > Fix For: 2.4.0 > > > It would be nice if we could specify an arbitrary class to run within each > executor for debugging and instrumentation. Its hard to do this currently > because: > a) you have no idea when executors will come and go with DynamicAllocation, > so don't have a chance to run custom code before the first task > b) even with static allocation, you'd have to change the code of your spark > app itself to run a special task to "install" the plugin, which is often > tough in production cases when those maintaining regularly running > applications might not even know how to make changes to the application. > For example, https://github.com/squito/spark-memory could be used in a > debugging context to understand memory use, just by re-running an application > with extra command line arguments (as opposed to rebuilding spark). > I think one tricky part here is just deciding the api, and how its versioned. > Does it just get created when the executor starts, and thats it? Or does it > get more specific events, like task start, task end, etc? Would we ever add > more events? It should definitely be a {{DeveloperApi}}, so breaking > compatibility would be allowed ... but still should be avoided. We could > create a base class that has no-op implementations, or explicitly version > everything. > Note that this is not needed in the driver as we already have SparkListeners > (even if you don't care about the SparkListenerEvents and just want to > inspect objects in the JVM, its still good enough). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29601) JDBC ODBC Tab Statement column should be provided ellipsis for bigger SQL statement
[ https://issues.apache.org/jira/browse/SPARK-29601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK KUMAR GUPTA updated SPARK-29601: - Description: Statement column in JDBC/ODBC gives whole SQL statement and page size increases. Suppose user submit and TPCDS Queries, then Page it display whole Query under statement and User Experience is not good. Expected: It should display the ... Ellipsis and on clicking the stmt. it should Expand display the whole SQL Statement. was: Statement column in JDBC/ODBC gives whole SQL statement and page size increases. Suppose user submit and TPCDS Queries, then Page it display whole Query under statement and User Experience is not good. Expected: It should display the ... Ellipsis and on clicking the stmt. it should Expand display the whole SQL Statement. e.g. SELECT * FROM (SELECT count(*) h8_30_to_9 FROM store_sales, household_demographics, time_dim, store WHERE ss_sold_time_sk = time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND ss_store_sk = s_store_sk AND time_dim.t_hour = 8 AND time_dim.t_minute >= 30 AND ((household_demographics.hd_dep_count = 4 AND household_demographics.hd_vehicle_count <= 4 + 2) OR (household_demographics.hd_dep_count = 2 AND household_demographics.hd_vehicle_count <= 2 + 2) OR (household_demographics.hd_dep_count = 0 AND household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = 'ese') s1, (SELECT count(*) h9_to_9_30 FROM store_sales, household_demographics, time_dim, store WHERE ss_sold_time_sk = time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND ss_store_sk = s_store_sk AND time_dim.t_hour = 9 AND time_dim.t_minute < 30 AND ((household_demographics.hd_dep_count = 4 AND household_demographics.hd_vehicle_count <= 4 + 2) OR (household_demographics.hd_dep_count = 2 AND household_demographics.hd_vehicle_count <= 2 + 2) OR (household_demographics.hd_dep_count = 0 AND household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = 'ese') s2, (SELECT count(*) h9_30_to_10 FROM store_sales, household_demographics, time_dim, store WHERE ss_sold_time_sk = time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND ss_store_sk = s_store_sk AND time_dim.t_hour = 9 AND time_dim.t_minute >= 30 AND ((household_demographics.hd_dep_count = 4 AND household_demographics.hd_vehicle_count <= 4 + 2) OR (household_demographics.hd_dep_count = 2 AND household_demographics.hd_vehicle_count <= 2 + 2) OR (household_demographics.hd_dep_count = 0 AND household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = 'ese') s3,(SELECT count(*) h10_to_10_30 FROM store_sales, household_demographics, time_dim, store WHERE ss_sold_time_sk = time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND ss_store_sk = s_store_sk AND time_dim.t_hour = 10 AND time_dim.t_minute < 30 AND ((household_demographics.hd_dep_count = 4 AND household_demographics.hd_vehicle_count <= 4 + 2) OR (household_demographics.hd_dep_count = 2 AND household_demographics.hd_vehicle_count <= 2 + 2) OR (household_demographics.hd_dep_count = 0 AND household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = 'ese') s4,(SELECT count(*) h10_30_to_11 FROM store_sales,household_demographics, time_dim, store WHERE ss_sold_time_sk = time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND ss_store_sk = s_store_sk AND time_dim.t_hour = 10 AND time_dim.t_minute >= 30 AND ((household_demographics.hd_dep_count = 4 AND household_demographics.hd_vehicle_count <= 4 + 2) OR (household_demographics.hd_dep_count = 2 AND household_demographics.hd_vehicle_count <= 2 + 2) OR (household_demographics.hd_dep_count = 0 AND household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = 'ese') s5,(SELECT count(*) h11_to_11_30 FROM store_sales, household_demographics, time_dim, store WHERE ss_sold_time_sk = time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND ss_store_sk = s_store_sk AND time_dim.t_hour = 11 AND time_dim.t_minute < 30 AND ((household_demographics.hd_dep_count = 4 AND household_demographics.hd_vehicle_count <= 4 + 2) OR (household_demographics.hd_dep_count = 2 AND household_demographics.hd_vehicle_count <= 2 + 2) OR (household_demographics.hd_dep_count = 0 AND household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = 'ese') s6,(SELECT count(*) h11_30_to_12 FROM store_sales, household_demographics, time_dim, store WHERE ss_sold_time_sk = time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND ss_store_sk = s_store_sk AND time_dim.t_hour = 11 AND time_dim.t_minute >= 30 AND ((household_demographics.hd_dep_count = 4 AND household_demographics.hd_vehicle_count <= 4 + 2) OR (household_demographics.hd_dep_count = 2 AND
[jira] [Updated] (SPARK-29600) array_contains built in function is not backward compatible in 3.0
[ https://issues.apache.org/jira/browse/SPARK-29600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK KUMAR GUPTA updated SPARK-29600: - Description: SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); throws Exception in 3.0 where as in 2.3.2 is working fine. Spark 3.0 output: 0: jdbc:hive2://10.18.19.208:23040/default> SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); Error: org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1BD AS DECIMAL(13,3)), CAST(0.2BD AS DECIMAL(13,3)), CAST(0.3BD AS DECIMAL(13,3)), CAST(0.5BD AS DECIMAL(13,3)), CAST(0.02BD AS DECIMAL(13,3)), CAST(0.033BD AS DECIMAL(13,3))), 0.2BD)' due to data type mismatch: Input to function array_contains should have been array followed by a value with same element type, but it's [array, decimal(1,1)].; line 1 pos 7; 'Project [unresolvedalias(array_contains(array(cast(0 as decimal(13,3)), cast(0.1 as decimal(13,3)), cast(0.2 as decimal(13,3)), cast(0.3 as decimal(13,3)), cast(0.5 as decimal(13,3)), cast(0.02 as decimal(13,3)), cast(0.033 as decimal(13,3))), 0.2), None)] Spark 2.3.2 output 0: jdbc:hive2://10.18.18.214:23040/default> SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); |array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1 AS DECIMAL(13,3)), CAST(0.2 AS DECIMAL(13,3)), CAST(0.3 AS DECIMAL(13,3)), CAST(0.5 AS DECIMAL(13,3)), CAST(0.02 AS DECIMAL(13,3)), CAST(0.033 AS DECIMAL(13,3))), CAST(0.2 AS DECIMAL(13,3)))| |true| 1 row selected (0.18 seconds) was: SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); throws Exception in 3.0 where as in 2.3.2 is working fine. Spark 3.0 output: 0: jdbc:hive2://10.18.19.208:23040/default> SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); Error: org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1BD AS DECIMAL(13,3)), CAST(0.2BD AS DECIMAL(13,3)), CAST(0.3BD AS DECIMAL(13,3)), CAST(0.5BD AS DECIMAL(13,3)), CAST(0.02BD AS DECIMAL(13,3)), CAST(0.033BD AS DECIMAL(13,3))), 0.2BD)' due to data type mismatch: Input to function array_contains should have been array followed by a value with same element type, but it's [array, decimal(1,1)].; line 1 pos 7; 'Project [unresolvedalias(array_contains(array(cast(0 as decimal(13,3)), cast(0.1 as decimal(13,3)), cast(0.2 as decimal(13,3)), cast(0.3 as decimal(13,3)), cast(0.5 as decimal(13,3)), cast(0.02 as decimal(13,3)), cast(0.033 as decimal(13,3))), 0.2), None)] +- OneRowRelation (state=,code=0) 0: jdbc:hive2://10.18.19.208:23040/default> ne 1 pos 7; Error: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'ne' expecting \{'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == ne 1 pos 7 ^^^ (state=,code=0) Spark 2.3.2 output 0: jdbc:hive2://10.18.18.214:23040/default> SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); +-+--+ | array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1 AS DECIMAL(13,3)), CAST(0.2 AS DECIMAL(13,3)), CAST(0.3 AS DECIMAL(13,3)), CAST(0.5 AS DECIMAL(13,3)), CAST(0.02 AS DECIMAL(13,3)), CAST(0.033 AS DECIMAL(13,3))), CAST(0.2 AS DECIMAL(13,3))) | +-+--+ | true | +-+--+ 1 row selected (0.18 seconds) > array_contains built in function is not backward compatible in 3.0 > -- > > Key: SPARK-29600 > URL: https://issues.apache.org/jira/browse/SPARK-29600 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); throws >
[jira] [Commented] (SPARK-29477) Improve tooltip information for Streaming Tab
[ https://issues.apache.org/jira/browse/SPARK-29477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961662#comment-16961662 ] ABHISHEK KUMAR GUPTA commented on SPARK-29477: -- [~hyukjin.kwon]: When user submit streaming job, Streaming Tab will display under WEB UI, This Tab will have Streaming Statistics in terms of Graph. Down there will be 2 Table Active Batches and Completed Batches. In Active Batches table columns like Batch Time, Records tool tips can provided. Same in Completed batch table column > Improve tooltip information for Streaming Tab > -- > > Key: SPARK-29477 > URL: https://issues.apache.org/jira/browse/SPARK-29477 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Active Batches Table and Completed Batches can be re look and put proper tip > for Batch Time and Record column -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29222) Flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_parameter_convergence
[ https://issues.apache.org/jira/browse/SPARK-29222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961636#comment-16961636 ] Hyukjin Kwon commented on SPARK-29222: -- Can you make a PR to increase the timeout? > Flaky test: > pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_parameter_convergence > --- > > Key: SPARK-29222 > URL: https://issues.apache.org/jira/browse/SPARK-29222 > Project: Spark > Issue Type: Test > Components: MLlib, Tests >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Minor > > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111237/testReport/] > {code:java} > Error Message > 7 != 10 > StacktraceTraceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 429, in test_parameter_convergence > self._eventually(condition, catch_assertions=True) > File > "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 74, in _eventually > raise lastValue > File > "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 65, in _eventually > lastValue = condition() > File > "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 425, in condition > self.assertEqual(len(model_weights), len(batches)) > AssertionError: 7 != 10 >{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29627) array_contains should allow column instances in PySpark
Hyukjin Kwon created SPARK-29627: Summary: array_contains should allow column instances in PySpark Key: SPARK-29627 URL: https://issues.apache.org/jira/browse/SPARK-29627 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon Scala API works well with column instances: {code} import org.apache.spark.sql.functions._ val df = Seq(Array("a", "b", "c"), Array.empty[String]).toDF("data") df.select(array_contains($"data", lit("a"))).collect() {code} {code} Array[org.apache.spark.sql.Row] = Array([true], [false]) {code} However, seems PySpark one doesn't: {code} from pyspark.sql.functions import array_contains, lit df = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data']) df.select(array_contains(df.data, lit("a"))).show() {code} {code} Traceback (most recent call last): File "", line 1, in File "/.../spark/python/pyspark/sql/functions.py", line 1950, in array_contains return Column(sc._jvm.functions.array_contains(_to_java_column(col), value)) File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1277, in __call__ File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1241, in _build_args File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1228, in _get_args File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_collections.py", line 500, in convert File "/.../spark/python/pyspark/sql/column.py", line 344, in __iter__ raise TypeError("Column is not iterable") TypeError: Column is not iterable {code} We should let it allow -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29626) notEqual() should return true when the one is null, the other is not null
zhouhuazheng created SPARK-29626: Summary: notEqual() should return true when the one is null, the other is not null Key: SPARK-29626 URL: https://issues.apache.org/jira/browse/SPARK-29626 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 2.4.4 Reporter: zhouhuazheng the one is null,the other is not null, then use the function notEqual(), we hope it return true . eg: scala> df.show() +--+---+ | age| name| +--+---+ | null|Michael| | 30| Andy| | 19| Justin| | 35| null| | 19| Justin| | null| null| |Justin| Justin| | 19| 19| +--+---+ scala> df.filter(col("age").notEqual(col("name"))).show +---+--+ |age| name| +---+--+ | 30| Andy| | 19|Justin| | 19|Justin| +---+--+ scala> df.filter(col("age").equalTo(col("name"))).show +--+--+ | age| name| +--+--+ | null| null| |Justin|Justin| | 19| 19| +--+--+ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29566) Imputer should support single-column input/ouput
[ https://issues.apache.org/jira/browse/SPARK-29566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-29566: Assignee: Huaxin Gao > Imputer should support single-column input/ouput > > > Key: SPARK-29566 > URL: https://issues.apache.org/jira/browse/SPARK-29566 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Minor > > Imputer should support single-column input/ouput > refer to https://issues.apache.org/jira/browse/SPARK-29565 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29566) Imputer should support single-column input/ouput
[ https://issues.apache.org/jira/browse/SPARK-29566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-29566. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26247 [https://github.com/apache/spark/pull/26247] > Imputer should support single-column input/ouput > > > Key: SPARK-29566 > URL: https://issues.apache.org/jira/browse/SPARK-29566 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > Imputer should support single-column input/ouput > refer to https://issues.apache.org/jira/browse/SPARK-29565 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29549) Union of DataSourceV2 datasources leads to duplicated results
[ https://issues.apache.org/jira/browse/SPARK-29549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29549. -- Resolution: Cannot Reproduce > Union of DataSourceV2 datasources leads to duplicated results > - > > Key: SPARK-29549 > URL: https://issues.apache.org/jira/browse/SPARK-29549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4 >Reporter: Miguel Molina >Priority: Major > > Hello! > I've discovered that when two DataSourceV2 data frames in a query of the > exact same shape are joined and there is an aggregation in the query, only > the first results are used. The rest get removed by the ReuseExchange rule > and reuse the results of the first data frame, leading to N times the first > data frame results. > > I've put together a repository with an example project where this can be > reproduced: [https://github.com/erizocosmico/spark-union-issue] > > Basically, doing this: > > {code:java} > val products = spark.sql("SELECT name, COUNT(*) as count FROM products GROUP > BY name") > val users = spark.sql("SELECT name, COUNT(*) as count FROM users GROUP BY > name") > products.union(users) > .select("name") > .show(truncate = false, numRows = 50){code} > > > Where products is: > {noformat} > +-+---+ > |name |id | > +-+---+ > |candy |1 | > |chocolate|2 | > |milk |3 | > |cinnamon |4 | > |pizza |5 | > |pineapple|6 | > +-+---+{noformat} > And users is: > {noformat} > +---+---+ > |name |id | > +---+---+ > |andy |1 | > |alice |2 | > |mike |3 | > |mariah |4 | > |eleanor|5 | > |matthew|6 | > +---+---+ {noformat} > > Results are incorrect: > {noformat} > +-+ > |name | > +-+ > |candy | > |pizza | > |chocolate| > |cinnamon | > |pineapple| > |milk | > |candy | > |pizza | > |chocolate| > |cinnamon | > |pineapple| > |milk | > +-+{noformat} > > This is the plan explained: > > {noformat} > == Parsed Logical Plan == > 'Project [unresolvedalias('name, None)] > +- AnalysisBarrier > +- Union > :- Aggregate [name#0], [name#0, count(1) AS count#8L] > : +- SubqueryAlias products > : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], > [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6])) > +- Aggregate [name#4], [name#4, count(1) AS count#12L] > +- SubqueryAlias users > +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], > [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6])) > == Analyzed Logical Plan == > name: string > Project [name#0] > +- Union > :- Aggregate [name#0], [name#0, count(1) AS count#8L] > : +- SubqueryAlias products > : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], > [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6])) > +- Aggregate [name#4], [name#4, count(1) AS count#12L] > +- SubqueryAlias users > +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], > [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6])) > == Optimized Logical Plan == > Union > :- Aggregate [name#0], [name#0] > : +- Project [name#0] > : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], > [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6])) > +- Aggregate [name#4], [name#4] > +- Project [name#4] > +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], > [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6])) > == Physical Plan == > Union > :- *(2) HashAggregate(keys=[name#0], functions=[], output=[name#0]) > : +- Exchange hashpartitioning(name#0, 200) > : +- *(1) HashAggregate(keys=[name#0], functions=[], output=[name#0]) > : +- *(1) Project [name#0] > : +- *(1) DataSourceV2Scan [name#0, id#1], DefaultReader(List([candy,1], > [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6])) > +- *(4) HashAggregate(keys=[name#4], functions=[], output=[name#4]) > +- ReusedExchange [name#4], Exchange hashpartitioning(name#0, 200) > {noformat} > > > In the physical plan, the first exchange is reused, but it shouldn't be > because both sources are not the same. > > {noformat} > == Physical Plan == > Union > :- *(2) HashAggregate(keys=[name#0], functions=[], output=[name#0]) > : +- Exchange hashpartitioning(name#0, 200) > : +- *(1) HashAggregate(keys=[name#0], functions=[], output=[name#0]) > : +- *(1) Project [name#0] > : +- *(1) DataSourceV2Scan [name#0, id#1], DefaultReader(List([candy,1], > [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6])) > +- *(4) HashAggregate(keys=[name#4], functions=[], output=[name#4]) > +- ReusedExchange [name#4], Exchange hashpartitioning(name#0, 200){noformat} > > This seems to be fixed in 2.4.x, but affects, 2.3.x versions. --
[jira] [Commented] (SPARK-29549) Union of DataSourceV2 datasources leads to duplicated results
[ https://issues.apache.org/jira/browse/SPARK-29549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961596#comment-16961596 ] Hyukjin Kwon commented on SPARK-29549: -- There will be no more releases in 2.3.x line. > Union of DataSourceV2 datasources leads to duplicated results > - > > Key: SPARK-29549 > URL: https://issues.apache.org/jira/browse/SPARK-29549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4 >Reporter: Miguel Molina >Priority: Major > > Hello! > I've discovered that when two DataSourceV2 data frames in a query of the > exact same shape are joined and there is an aggregation in the query, only > the first results are used. The rest get removed by the ReuseExchange rule > and reuse the results of the first data frame, leading to N times the first > data frame results. > > I've put together a repository with an example project where this can be > reproduced: [https://github.com/erizocosmico/spark-union-issue] > > Basically, doing this: > > {code:java} > val products = spark.sql("SELECT name, COUNT(*) as count FROM products GROUP > BY name") > val users = spark.sql("SELECT name, COUNT(*) as count FROM users GROUP BY > name") > products.union(users) > .select("name") > .show(truncate = false, numRows = 50){code} > > > Where products is: > {noformat} > +-+---+ > |name |id | > +-+---+ > |candy |1 | > |chocolate|2 | > |milk |3 | > |cinnamon |4 | > |pizza |5 | > |pineapple|6 | > +-+---+{noformat} > And users is: > {noformat} > +---+---+ > |name |id | > +---+---+ > |andy |1 | > |alice |2 | > |mike |3 | > |mariah |4 | > |eleanor|5 | > |matthew|6 | > +---+---+ {noformat} > > Results are incorrect: > {noformat} > +-+ > |name | > +-+ > |candy | > |pizza | > |chocolate| > |cinnamon | > |pineapple| > |milk | > |candy | > |pizza | > |chocolate| > |cinnamon | > |pineapple| > |milk | > +-+{noformat} > > This is the plan explained: > > {noformat} > == Parsed Logical Plan == > 'Project [unresolvedalias('name, None)] > +- AnalysisBarrier > +- Union > :- Aggregate [name#0], [name#0, count(1) AS count#8L] > : +- SubqueryAlias products > : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], > [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6])) > +- Aggregate [name#4], [name#4, count(1) AS count#12L] > +- SubqueryAlias users > +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], > [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6])) > == Analyzed Logical Plan == > name: string > Project [name#0] > +- Union > :- Aggregate [name#0], [name#0, count(1) AS count#8L] > : +- SubqueryAlias products > : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], > [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6])) > +- Aggregate [name#4], [name#4, count(1) AS count#12L] > +- SubqueryAlias users > +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], > [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6])) > == Optimized Logical Plan == > Union > :- Aggregate [name#0], [name#0] > : +- Project [name#0] > : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], > [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6])) > +- Aggregate [name#4], [name#4] > +- Project [name#4] > +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], > [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6])) > == Physical Plan == > Union > :- *(2) HashAggregate(keys=[name#0], functions=[], output=[name#0]) > : +- Exchange hashpartitioning(name#0, 200) > : +- *(1) HashAggregate(keys=[name#0], functions=[], output=[name#0]) > : +- *(1) Project [name#0] > : +- *(1) DataSourceV2Scan [name#0, id#1], DefaultReader(List([candy,1], > [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6])) > +- *(4) HashAggregate(keys=[name#4], functions=[], output=[name#4]) > +- ReusedExchange [name#4], Exchange hashpartitioning(name#0, 200) > {noformat} > > > In the physical plan, the first exchange is reused, but it shouldn't be > because both sources are not the same. > > {noformat} > == Physical Plan == > Union > :- *(2) HashAggregate(keys=[name#0], functions=[], output=[name#0]) > : +- Exchange hashpartitioning(name#0, 200) > : +- *(1) HashAggregate(keys=[name#0], functions=[], output=[name#0]) > : +- *(1) Project [name#0] > : +- *(1) DataSourceV2Scan [name#0, id#1], DefaultReader(List([candy,1], > [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6])) > +- *(4) HashAggregate(keys=[name#4], functions=[], output=[name#4]) > +- ReusedExchange [name#4], Exchange hashpartitioning(name#0, 200){noformat} > > This seems to be
[jira] [Updated] (SPARK-27842) Inconsistent results of Statistics.corr() and PearsonCorrelation.computeCorrelationMatrix()
[ https://issues.apache.org/jira/browse/SPARK-27842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27842: - Affects Version/s: 2.4.3 > Inconsistent results of Statistics.corr() and > PearsonCorrelation.computeCorrelationMatrix() > --- > > Key: SPARK-27842 > URL: https://issues.apache.org/jira/browse/SPARK-27842 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core, Windows >Affects Versions: 2.3.1, 2.4.3 >Reporter: Peter Nijem >Priority: Major > Attachments: vectorList.txt > > > Hi, > I am working with Spark Java API in local mode (1 node, 8 cores). Spark > version as follows in my pom.xml: > *MLLib* > _spark-mllib_2.11_ > _2.3.1_ > *Core* > _spark-core_2.11_ > _2.3.1_ > I am experiencing inconsistent results of correlation when starting my Spark > application with 8 cores vs 1/2/3 cores. > I've created a Main class which reads from a file a list of Vectors; 240 > Vector which each Vector is of the length of 226. > As you can see, I am firstly initializing Spark with local[*], running > Correlation, saving the Matrix and stopping Spark. Then, I do the same but > for local[3]. > I am expecting to get the same matrices on both runs. But this is not the > case. The input file is attached. > I tried to compute the correlation using > PearsonCorrealtion.computeCorrelationMatrix() but I faced the same issue here > as well. > > In my work, I am dependent on the resulting correlation matrix. Thus, I am > experiencing bad results in y application due to the inconsistent results I > am getting. As a workaround, I am working now with local[1] > > > {code:java} > import org.apache.spark.SparkConf; > import org.apache.spark.api.java.JavaSparkContext; > import org.apache.spark.mllib.linalg.DenseVector; > import org.apache.spark.mllib.linalg.Matrix; > import org.apache.spark.mllib.linalg.Vector; > import org.apache.spark.mllib.stat.Statistics; > import org.apache.spark.rdd.RDD; > import org.junit.Assert; > import java.io.BufferedReader; > import java.io.FileReader; > import java.io.IOException; > import java.math.RoundingMode; > import java.text.DecimalFormat; > import java.util.ArrayList; > import java.util.Arrays; > import java.util.List; > import java.util.stream.Collectors; > public class TestSparkCorr { > private static JavaSparkContext ctx; > public static void main(String[] args) { > List> doublesLists = readInputFile(); > List resultVectors = getVectorsList(doublesLists); > //=== > initSpark("*"); > RDD RDD_vector = ctx.parallelize(resultVectors).rdd(); > Matrix m = Statistics.corr(RDD_vector, "pearson"); > stopSpark(); > //=== > initSpark("3"); > RDD RDD_vector_3 = ctx.parallelize(resultVectors).rdd(); > Matrix m3 = Statistics.corr(RDD_vector_3, "pearson"); > stopSpark(); > //=== > Assert.assertEquals(m3, m); > } > private static List getVectorsList(List> doublesLists) { > List resultVectors = new ArrayList<>(doublesLists.size()); > for (List vector : doublesLists) { > double[] x = new double[vector.size()]; > for(int i = 0; i < x.length; i++){ > x[i] = vector.get(i); > } > resultVectors.add(new DenseVector(x)); > } > return resultVectors; > } > private static List> readInputFile() { > List> doublesLists = new ArrayList<>(); > try (BufferedReader reader = new BufferedReader(new FileReader( > ".//output//vectorList.txt"))) { > String line = reader.readLine(); > while (line != null) { > String[] splitLine = line.substring(1, line.length() - 2).split(","); > List doubles = Arrays.stream(splitLine).map(x -> > Double.valueOf(x.trim())).collect(Collectors.toList()); > doublesLists.add(doubles); > // read next line > line = reader.readLine(); > } > } catch (IOException e) { > e.printStackTrace(); > } > return doublesLists; > } > private static void initSpark(String coresNum) { > final SparkConf sparkConf = new SparkConf().setAppName("Span"); > sparkConf.setMaster(String.format("local[%s]", coresNum)); > sparkConf.set("spark.ui.enabled", "false"); > ctx = new JavaSparkContext(sparkConf); > } > private static void stopSpark() { > ctx.stop(); > if(ctx.sc().isStopped()){ > ctx = null; > } > } > } > {code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29624) Jenkins fails with "Python versions prior to 2.7 are not supported."
[ https://issues.apache.org/jira/browse/SPARK-29624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961595#comment-16961595 ] Hyukjin Kwon commented on SPARK-29624: -- Awesome [~shaneknapp]! > Jenkins fails with "Python versions prior to 2.7 are not supported." > > > Key: SPARK-29624 > URL: https://issues.apache.org/jira/browse/SPARK-29624 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Shane Knapp >Priority: Blocker > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112777/console > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112779/console > {code} > ... > + ./dev/run-tests-jenkins > Python versions prior to 2.7 are not supported. > Build step 'Execute shell' marked build as failure > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice
[ https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandish Kumar HN updated SPARK-29625: - Description: Spark Structure Streaming Kafka Reset Offset twice, once with right offsets and second time with very old offsets [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-151 to offset 0. [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-118 to offset 0. [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-85 to offset 0. ** *[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-52 to offset 122677634.* [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-19 to offset 0. ** *[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-52 to offset 120504922.* [2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO ContextCleaner: Cleaned accumulator 810 which is causing a Data loss issue. {{[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 ERROR StreamExecution: Query [id = d62ca9e4-6650-454f-8691-a3d576d1e4ba, runId = 3946389f-222b-495c-9ab2-832c0422cbbb] terminated with error [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - java.lang.IllegalStateException: Partition topic-52's offset was changed from 122677598 to 120504922, some data may have been missed. [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - Some data may have been lost because they are not available in Kafka any more; either the [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - data was aged out by Kafka or the topic may have been deleted before all the data in the [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - topic was processed. If you don't want your streaming query to fail on such cases, set the [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - source option "failOnDataLoss" to "false". [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at org.apache.spark.sql.kafka010.KafkaSource.org$apache$spark$sql$kafka010$KafkaSource$$reportDataLoss(KafkaSource.scala:329) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:283) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:281) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at scala.collection.AbstractTraversable.filter(Traversable.scala:104) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:281) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$7.apply(StreamExecution.scala:614) [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at
[jira] [Updated] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice
[ https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandish Kumar HN updated SPARK-29625: - Summary: Spark Structure Streaming Kafka Wrong Reset Offset twice (was: Spark Stucture Streaming Kafka Reset Offset twice) > Spark Structure Streaming Kafka Wrong Reset Offset twice > > > Key: SPARK-29625 > URL: https://issues.apache.org/jira/browse/SPARK-29625 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Sandish Kumar HN >Priority: Major > > Spark Stucture Streaming Kafka Reset Offset twice, once with right offsets > and second time with very old offsets > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-151 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-118 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-85 to offset 0. > ** > *[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-52 to offset 122677634.* > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-19 to offset 0. > ** > *[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-52 to offset 120504922.* > [2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO ContextCleaner: Cleaned accumulator 810 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29625) Spark Stucture Streaming Kafka Reset Offset twice
Sandish Kumar HN created SPARK-29625: Summary: Spark Stucture Streaming Kafka Reset Offset twice Key: SPARK-29625 URL: https://issues.apache.org/jira/browse/SPARK-29625 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.1 Reporter: Sandish Kumar HN Spark Stucture Streaming Kafka Reset Offset twice, once with right offsets and second time with very old offsets [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-151 to offset 0. [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-118 to offset 0. [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-85 to offset 0. ** *[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-52 to offset 122677634.* [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-19 to offset 0. ** *[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-52 to offset 120504922.* [2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO ContextCleaner: Cleaned accumulator 810 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice
[ https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandish Kumar HN updated SPARK-29625: - Description: Spark Structure Streaming Kafka Reset Offset twice, once with right offsets and second time with very old offsets [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-151 to offset 0. [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-118 to offset 0. [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-85 to offset 0. ** *[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-52 to offset 122677634.* [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-19 to offset 0. ** *[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-52 to offset 120504922.* [2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO ContextCleaner: Cleaned accumulator 810 was: Spark Stucture Streaming Kafka Reset Offset twice, once with right offsets and second time with very old offsets [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-151 to offset 0. [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-118 to offset 0. [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-85 to offset 0. ** *[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-52 to offset 122677634.* [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-19 to offset 0. ** *[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] Resetting offset for partition topic-52 to offset 120504922.* [2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO ContextCleaner: Cleaned accumulator 810 > Spark Structure Streaming Kafka Wrong Reset Offset twice > > > Key: SPARK-29625 > URL: https://issues.apache.org/jira/browse/SPARK-29625 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Sandish Kumar HN >Priority: Major > > Spark Structure Streaming Kafka Reset Offset twice, once with right offsets > and second time with very old offsets > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-151 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-118 to
[jira] [Assigned] (SPARK-29609) DataSourceV2: Support DROP NAMESPACE
[ https://issues.apache.org/jira/browse/SPARK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29609: - Assignee: Terry Kim > DataSourceV2: Support DROP NAMESPACE > > > Key: SPARK-29609 > URL: https://issues.apache.org/jira/browse/SPARK-29609 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > > DROP NAMESPACE needs to support v2 catalogs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29609) DataSourceV2: Support DROP NAMESPACE
[ https://issues.apache.org/jira/browse/SPARK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29609. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26262 [https://github.com/apache/spark/pull/26262] > DataSourceV2: Support DROP NAMESPACE > > > Key: SPARK-29609 > URL: https://issues.apache.org/jira/browse/SPARK-29609 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > DROP NAMESPACE needs to support v2 catalogs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28841) Spark cannot read a relative path containing ":"
[ https://issues.apache.org/jira/browse/SPARK-28841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961412#comment-16961412 ] Shixiong Zhu commented on SPARK-28841: -- [~srowen] Yep. ":" is not a valid char in a HDFS path. But it's valid for some other file systems, such as local file system, or S3AFileSystem, and they allow the user to create such files. I pointed out the codes that hit this issue in HDFS-14762. > Spark cannot read a relative path containing ":" > > > Key: SPARK-28841 > URL: https://issues.apache.org/jira/browse/SPARK-28841 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Shixiong Zhu >Priority: Major > > Reproducer: > {code} > spark.read.parquet("test:test") > {code} > Error: > {code} > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: test:test > {code} > This is actually a Hadoop issue since the error is thrown from "new > Path("test:test")". I'm creating this ticket to see if we can work around > this issue in Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28841) Spark cannot read a relative path containing ":"
[ https://issues.apache.org/jira/browse/SPARK-28841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961408#comment-16961408 ] Sean R. Owen commented on SPARK-28841: -- Oh I see, OK. So is this mostly a Hadoop-side issue? > Spark cannot read a relative path containing ":" > > > Key: SPARK-28841 > URL: https://issues.apache.org/jira/browse/SPARK-28841 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Shixiong Zhu >Priority: Major > > Reproducer: > {code} > spark.read.parquet("test:test") > {code} > Error: > {code} > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: test:test > {code} > This is actually a Hadoop issue since the error is thrown from "new > Path("test:test")". I'm creating this ticket to see if we can work around > this issue in Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27592) Set the bucketed data source table SerDe correctly
[ https://issues.apache.org/jira/browse/SPARK-27592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961404#comment-16961404 ] Dongjoon Hyun edited comment on SPARK-27592 at 10/28/19 7:44 PM: - Ping [~Patnaik] since you asked about this on SPARK-29234. Originally, this was registered as an improvement PR, so we didn't backport this to the older branches. However, given the situation, I'm also fine for [~yumwang] to backport these since he is the author of this. BTW, please don't expect a backporting to EOL branches like `branch-2.3`. - [https://spark.apache.org/versioning-policy.html] {quote}Feature release branches will, generally, be maintained with bug fix releases for a period of 18 months. For example, branch 2.3.x is no longer considered maintained as of September 2019, 18 months after the release of 2.3.0 in February 2018. No more 2.3.x releases should be expected after that point, even for bug fixes. {quote} was (Author: dongjoon): Ping [~Patnaik] since you asked about this on SPARK-29234. Originally, this is registered as an improvement PR, so we don't backport this to the older branches. However, given the situation, I'm also fine for [~yumwang] to backporting these since he is the author of this. BTW, please don't expect a backporting to EOL branches like `branch-2.3`. - [https://spark.apache.org/versioning-policy.html] {quote}Feature release branches will, generally, be maintained with bug fix releases for a period of 18 months. For example, branch 2.3.x is no longer considered maintained as of September 2019, 18 months after the release of 2.3.0 in February 2018. No more 2.3.x releases should be expected after that point, even for bug fixes. {quote} > Set the bucketed data source table SerDe correctly > -- > > Key: SPARK-27592 > URL: https://issues.apache.org/jira/browse/SPARK-27592 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > We hint Hive using incorrect > InputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat) to read Spark's > Parquet datasource bucket table: > {noformat} > spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) > SORTED BY (c1) INTO 2 BUCKETS; > 2019-04-29 17:52:05 WARN HiveExternalCatalog:66 - Persisting bucketed data > source table `default`.`t` into Hive metastore in Spark SQL specific format, > which is NOT compatible with Hive. > spark-sql> DESC EXTENDED t; > c1 int NULL > c2 int NULL > # Detailed Table Information > Database default > Table t > Owner yumwang > Created Time Mon Apr 29 17:52:05 CST 2019 > Last Access Thu Jan 01 08:00:00 CST 1970 > Created By Spark 2.4.0 > Type MANAGED > Provider parquet > Num Buckets 2 > Bucket Columns [`c1`] > Sort Columns [`c1`] > Table Properties [transient_lastDdlTime=1556531525] > Location [file:/user/hive/warehouse/t|file:///user/hive/warehouse/t] > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > Storage Properties [serialization.format=1] > {noformat} > We can see incompatible information when creating the table: > {noformat} > WARN HiveExternalCatalog:66 - Persisting bucketed data source table > `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT > compatible with Hive. > {noformat} > But downstream don’t know the compatibility. I'd like to write the write > information of this table to metadata so that each engine decides > compatibility itself. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27592) Set the bucketed data source table SerDe correctly
[ https://issues.apache.org/jira/browse/SPARK-27592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961404#comment-16961404 ] Dongjoon Hyun commented on SPARK-27592: --- Ping [~Patnaik] since you asked about this on SPARK-29234. Originally, this is registered as an improvement PR, so we don't backport this to the older branches. However, given the situation, I'm also fine for [~yumwang] to backporting these since he is the author of this. BTW, please don't expect a backporting to EOL branches like `branch-2.3`. - [https://spark.apache.org/versioning-policy.html] {quote}Feature release branches will, generally, be maintained with bug fix releases for a period of 18 months. For example, branch 2.3.x is no longer considered maintained as of September 2019, 18 months after the release of 2.3.0 in February 2018. No more 2.3.x releases should be expected after that point, even for bug fixes. {quote} > Set the bucketed data source table SerDe correctly > -- > > Key: SPARK-27592 > URL: https://issues.apache.org/jira/browse/SPARK-27592 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > We hint Hive using incorrect > InputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat) to read Spark's > Parquet datasource bucket table: > {noformat} > spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) > SORTED BY (c1) INTO 2 BUCKETS; > 2019-04-29 17:52:05 WARN HiveExternalCatalog:66 - Persisting bucketed data > source table `default`.`t` into Hive metastore in Spark SQL specific format, > which is NOT compatible with Hive. > spark-sql> DESC EXTENDED t; > c1 int NULL > c2 int NULL > # Detailed Table Information > Database default > Table t > Owner yumwang > Created Time Mon Apr 29 17:52:05 CST 2019 > Last Access Thu Jan 01 08:00:00 CST 1970 > Created By Spark 2.4.0 > Type MANAGED > Provider parquet > Num Buckets 2 > Bucket Columns [`c1`] > Sort Columns [`c1`] > Table Properties [transient_lastDdlTime=1556531525] > Location [file:/user/hive/warehouse/t|file:///user/hive/warehouse/t] > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > Storage Properties [serialization.format=1] > {noformat} > We can see incompatible information when creating the table: > {noformat} > WARN HiveExternalCatalog:66 - Persisting bucketed data source table > `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT > compatible with Hive. > {noformat} > But downstream don’t know the compatibility. I'd like to write the write > information of this table to metadata so that each engine decides > compatibility itself. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28841) Spark cannot read a relative path containing ":"
[ https://issues.apache.org/jira/browse/SPARK-28841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961402#comment-16961402 ] Shixiong Zhu commented on SPARK-28841: -- [~srowen] "?" is to trigger glob pattern codes. It matches any chars. That's not a typo. The code I post here is trying to read files that match the pattern "/tmp/test/foo?bar". But if "/tmp/test/foo:bar" exists, it will fail because of HDFS-14762. > Spark cannot read a relative path containing ":" > > > Key: SPARK-28841 > URL: https://issues.apache.org/jira/browse/SPARK-28841 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Shixiong Zhu >Priority: Major > > Reproducer: > {code} > spark.read.parquet("test:test") > {code} > Error: > {code} > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: test:test > {code} > This is actually a Hadoop issue since the error is thrown from "new > Path("test:test")". I'm creating this ticket to see if we can work around > this issue in Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format
[ https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961395#comment-16961395 ] Dongjoon Hyun commented on SPARK-29234: --- [~Patnaik]. As you see, this is closed as `Duplicated`. You had better ping us on SPARK-27592. :) > bucketed table created by Spark SQL DataFrame is in SequenceFile format > --- > > Key: SPARK-29234 > URL: https://issues.apache.org/jira/browse/SPARK-29234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Major > > When we create a bucketed table as follows, it's input and output format are > getting displayed as SequenceFile format. But physically the files are > getting created in HDFS as the format specified by the user e.g. > orc,parquet,etc. > df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample") > in Hive, DESCRIBE FORMATTED OrdersExample; > describe formatted ordersExample; > OK > # col_name data_type comment > col array from deserializer > # Storage Information > SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > Querying the same table in Hive is giving error. > select * from OrdersExample; > OK > Failed with exception java.io.IOException:java.io.IOException: > hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc > not a SequenceFile -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29624) Jenkins fails with "Python versions prior to 2.7 are not supported."
[ https://issues.apache.org/jira/browse/SPARK-29624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29624: - Assignee: Shane Knapp > Jenkins fails with "Python versions prior to 2.7 are not supported." > > > Key: SPARK-29624 > URL: https://issues.apache.org/jira/browse/SPARK-29624 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Shane Knapp >Priority: Blocker > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112777/console > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112779/console > {code} > ... > + ./dev/run-tests-jenkins > Python versions prior to 2.7 are not supported. > Build step 'Execute shell' marked build as failure > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29624) Jenkins fails with "Python versions prior to 2.7 are not supported."
[ https://issues.apache.org/jira/browse/SPARK-29624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shane Knapp resolved SPARK-29624. - Resolution: Fixed > Jenkins fails with "Python versions prior to 2.7 are not supported." > > > Key: SPARK-29624 > URL: https://issues.apache.org/jira/browse/SPARK-29624 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112777/console > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112779/console > {code} > ... > + ./dev/run-tests-jenkins > Python versions prior to 2.7 are not supported. > Build step 'Execute shell' marked build as failure > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29624) Jenkins fails with "Python versions prior to 2.7 are not supported."
[ https://issues.apache.org/jira/browse/SPARK-29624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961377#comment-16961377 ] Shane Knapp commented on SPARK-29624: - alright, a pull request build has successfully made it past this check: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112790/ resolving now. > Jenkins fails with "Python versions prior to 2.7 are not supported." > > > Key: SPARK-29624 > URL: https://issues.apache.org/jira/browse/SPARK-29624 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112777/console > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112779/console > {code} > ... > + ./dev/run-tests-jenkins > Python versions prior to 2.7 are not supported. > Build step 'Execute shell' marked build as failure > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29624) Jenkins fails with "Python versions prior to 2.7 are not supported."
[ https://issues.apache.org/jira/browse/SPARK-29624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961369#comment-16961369 ] Dongjoon Hyun commented on SPARK-29624: --- Thank you so much! > Jenkins fails with "Python versions prior to 2.7 are not supported." > > > Key: SPARK-29624 > URL: https://issues.apache.org/jira/browse/SPARK-29624 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112777/console > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112779/console > {code} > ... > + ./dev/run-tests-jenkins > Python versions prior to 2.7 are not supported. > Build step 'Execute shell' marked build as failure > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29509) Deduplicate code blocks in Kafka data source
[ https://issues.apache.org/jira/browse/SPARK-29509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-29509. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26158 [https://github.com/apache/spark/pull/26158] > Deduplicate code blocks in Kafka data source > > > Key: SPARK-29509 > URL: https://issues.apache.org/jira/browse/SPARK-29509 > Project: Spark > Issue Type: Task > Components: SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Minor > Fix For: 3.0.0 > > > There're bunch of methods in Kafka data source which have repeated lines in a > method - especially they're tied to the number of fields in writer schema, so > once we add a new field redundant code lines will be increased. This issue > tracks the efforts to deduplicate them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29509) Deduplicate code blocks in Kafka data source
[ https://issues.apache.org/jira/browse/SPARK-29509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin reassigned SPARK-29509: -- Assignee: Jungtaek Lim > Deduplicate code blocks in Kafka data source > > > Key: SPARK-29509 > URL: https://issues.apache.org/jira/browse/SPARK-29509 > Project: Spark > Issue Type: Task > Components: SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Minor > > There're bunch of methods in Kafka data source which have repeated lines in a > method - especially they're tied to the number of fields in writer schema, so > once we add a new field redundant code lines will be increased. This issue > tracks the efforts to deduplicate them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29624) Jenkins fails with "Python versions prior to 2.7 are not supported."
[ https://issues.apache.org/jira/browse/SPARK-29624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961320#comment-16961320 ] Shane Knapp commented on SPARK-29624: - alright, i triggered a job and checked the console, and the restart seemed to fix the PATH variable. both of these above builds failed on amp-jenkins-worker-03, so i'll keep an eye on that worker. > Jenkins fails with "Python versions prior to 2.7 are not supported." > > > Key: SPARK-29624 > URL: https://issues.apache.org/jira/browse/SPARK-29624 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112777/console > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112779/console > {code} > ... > + ./dev/run-tests-jenkins > Python versions prior to 2.7 are not supported. > Build step 'Execute shell' marked build as failure > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29624) Jenkins fails with "Python versions prior to 2.7 are not supported."
[ https://issues.apache.org/jira/browse/SPARK-29624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961290#comment-16961290 ] Shane Knapp commented on SPARK-29624: - nothing changed... the PATH env vars for each worker got borked during the downtime. i'll need to restart jenkins, and will send a note to the list about this. > Jenkins fails with "Python versions prior to 2.7 are not supported." > > > Key: SPARK-29624 > URL: https://issues.apache.org/jira/browse/SPARK-29624 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112777/console > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112779/console > {code} > ... > + ./dev/run-tests-jenkins > Python versions prior to 2.7 are not supported. > Build step 'Execute shell' marked build as failure > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29624) Jenkins fails with "Python versions prior to 2.7 are not supported."
[ https://issues.apache.org/jira/browse/SPARK-29624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961287#comment-16961287 ] Dongjoon Hyun commented on SPARK-29624: --- Hi, [~shaneknapp]. Jenkins server seems to be changed. Could you take a look at this issue? cc [~cloud_fan] and [~hyukjin.kwon] > Jenkins fails with "Python versions prior to 2.7 are not supported." > > > Key: SPARK-29624 > URL: https://issues.apache.org/jira/browse/SPARK-29624 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112777/console > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112779/console > {code} > ... > + ./dev/run-tests-jenkins > Python versions prior to 2.7 are not supported. > Build step 'Execute shell' marked build as failure > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29624) Jenkins fails with "Python versions prior to 2.7 are not supported."
Dongjoon Hyun created SPARK-29624: - Summary: Jenkins fails with "Python versions prior to 2.7 are not supported." Key: SPARK-29624 URL: https://issues.apache.org/jira/browse/SPARK-29624 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 3.0.0 Reporter: Dongjoon Hyun - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112777/console - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112779/console {code} ... + ./dev/run-tests-jenkins Python versions prior to 2.7 are not supported. Build step 'Execute shell' marked build as failure {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29521) LOAD DATA INTO TABLE should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29521. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26178 [https://github.com/apache/spark/pull/26178] > LOAD DATA INTO TABLE should look up catalog/table like v2 commands > -- > > Key: SPARK-29521 > URL: https://issues.apache.org/jira/browse/SPARK-29521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.0.0 > > > LOAD DATA INTO TABLE should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22000) org.codehaus.commons.compiler.CompileException: toString method is not declared
[ https://issues.apache.org/jira/browse/SPARK-22000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961171#comment-16961171 ] Sean R. Owen commented on SPARK-22000: -- As indicated, it's fixed in 3.0, not 2.4.x. The change is substantial so is hard to back-port, but I'd review a change that includes the three linked PRs above in 2.4, if it can be made to work. > org.codehaus.commons.compiler.CompileException: toString method is not > declared > --- > > Key: SPARK-22000 > URL: https://issues.apache.org/jira/browse/SPARK-22000 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: taiho choi >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > Attachments: testcase.zip > > > the error message say that toString is not declared on "value13" which is > "long" type in generated code. > i think value13 should be Long type. > ==error message > Caused by: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 70, Column 32: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 70, Column 32: A method named "toString" is not declared in any enclosing > class nor any supertype, nor through a static import > /* 033 */ private void apply1_2(InternalRow i) { > /* 034 */ > /* 035 */ > /* 036 */ boolean isNull11 = i.isNullAt(1); > /* 037 */ UTF8String value11 = isNull11 ? null : (i.getUTF8String(1)); > /* 038 */ boolean isNull10 = true; > /* 039 */ java.lang.String value10 = null; > /* 040 */ if (!isNull11) { > /* 041 */ > /* 042 */ isNull10 = false; > /* 043 */ if (!isNull10) { > /* 044 */ > /* 045 */ Object funcResult4 = null; > /* 046 */ funcResult4 = value11.toString(); > /* 047 */ > /* 048 */ if (funcResult4 != null) { > /* 049 */ value10 = (java.lang.String) funcResult4; > /* 050 */ } else { > /* 051 */ isNull10 = true; > /* 052 */ } > /* 053 */ > /* 054 */ > /* 055 */ } > /* 056 */ } > /* 057 */ javaBean.setApp(value10); > /* 058 */ > /* 059 */ > /* 060 */ boolean isNull13 = i.isNullAt(12); > /* 061 */ long value13 = isNull13 ? -1L : (i.getLong(12)); > /* 062 */ boolean isNull12 = true; > /* 063 */ java.lang.String value12 = null; > /* 064 */ if (!isNull13) { > /* 065 */ > /* 066 */ isNull12 = false; > /* 067 */ if (!isNull12) { > /* 068 */ > /* 069 */ Object funcResult5 = null; > /* 070 */ funcResult5 = value13.toString(); > /* 071 */ > /* 072 */ if (funcResult5 != null) { > /* 073 */ value12 = (java.lang.String) funcResult5; > /* 074 */ } else { > /* 075 */ isNull12 = true; > /* 076 */ } > /* 077 */ > /* 078 */ > /* 079 */ } > /* 080 */ } > /* 081 */ javaBean.setReasonCode(value12); > /* 082 */ > /* 083 */ } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22000) org.codehaus.commons.compiler.CompileException: toString method is not declared
[ https://issues.apache.org/jira/browse/SPARK-22000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961159#comment-16961159 ] Shyam commented on SPARK-22000: --- [~srowen] I am using spark.2.4.1 version getting same error ... for more details please check this , [https://stackoverflow.com/questions/58593215/inserting-into-cassandra-table-from-spark-dataframe-results-in-org-codehaus-comm] how to fix this ? > org.codehaus.commons.compiler.CompileException: toString method is not > declared > --- > > Key: SPARK-22000 > URL: https://issues.apache.org/jira/browse/SPARK-22000 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: taiho choi >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > Attachments: testcase.zip > > > the error message say that toString is not declared on "value13" which is > "long" type in generated code. > i think value13 should be Long type. > ==error message > Caused by: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 70, Column 32: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 70, Column 32: A method named "toString" is not declared in any enclosing > class nor any supertype, nor through a static import > /* 033 */ private void apply1_2(InternalRow i) { > /* 034 */ > /* 035 */ > /* 036 */ boolean isNull11 = i.isNullAt(1); > /* 037 */ UTF8String value11 = isNull11 ? null : (i.getUTF8String(1)); > /* 038 */ boolean isNull10 = true; > /* 039 */ java.lang.String value10 = null; > /* 040 */ if (!isNull11) { > /* 041 */ > /* 042 */ isNull10 = false; > /* 043 */ if (!isNull10) { > /* 044 */ > /* 045 */ Object funcResult4 = null; > /* 046 */ funcResult4 = value11.toString(); > /* 047 */ > /* 048 */ if (funcResult4 != null) { > /* 049 */ value10 = (java.lang.String) funcResult4; > /* 050 */ } else { > /* 051 */ isNull10 = true; > /* 052 */ } > /* 053 */ > /* 054 */ > /* 055 */ } > /* 056 */ } > /* 057 */ javaBean.setApp(value10); > /* 058 */ > /* 059 */ > /* 060 */ boolean isNull13 = i.isNullAt(12); > /* 061 */ long value13 = isNull13 ? -1L : (i.getLong(12)); > /* 062 */ boolean isNull12 = true; > /* 063 */ java.lang.String value12 = null; > /* 064 */ if (!isNull13) { > /* 065 */ > /* 066 */ isNull12 = false; > /* 067 */ if (!isNull12) { > /* 068 */ > /* 069 */ Object funcResult5 = null; > /* 070 */ funcResult5 = value13.toString(); > /* 071 */ > /* 072 */ if (funcResult5 != null) { > /* 073 */ value12 = (java.lang.String) funcResult5; > /* 074 */ } else { > /* 075 */ isNull12 = true; > /* 076 */ } > /* 077 */ > /* 078 */ > /* 079 */ } > /* 080 */ } > /* 081 */ javaBean.setReasonCode(value12); > /* 082 */ > /* 083 */ } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29623) do not allow multiple unit TO unit statements in interval literal syntax
Wenchen Fan created SPARK-29623: --- Summary: do not allow multiple unit TO unit statements in interval literal syntax Key: SPARK-29623 URL: https://issues.apache.org/jira/browse/SPARK-29623 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29622) do not allow leading 'interval' in the interval string format
Wenchen Fan created SPARK-29622: --- Summary: do not allow leading 'interval' in the interval string format Key: SPARK-29622 URL: https://issues.apache.org/jira/browse/SPARK-29622 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29599) Support pagination for session table in JDBC/ODBC Tab
[ https://issues.apache.org/jira/browse/SPARK-29599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29599. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26253 [https://github.com/apache/spark/pull/26253] > Support pagination for session table in JDBC/ODBC Tab > -- > > Key: SPARK-29599 > URL: https://issues.apache.org/jira/browse/SPARK-29599 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > Fix For: 3.0.0 > > > Support pagination for session table in JDBC/ODBC Session page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29599) Support pagination for session table in JDBC/ODBC Tab
[ https://issues.apache.org/jira/browse/SPARK-29599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-29599: Assignee: angerszhu > Support pagination for session table in JDBC/ODBC Tab > -- > > Key: SPARK-29599 > URL: https://issues.apache.org/jira/browse/SPARK-29599 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > > Support pagination for session table in JDBC/ODBC Session page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation
[ https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suchintak Patnaik updated SPARK-29621: -- Labels: PySpark SparkSQL (was: ) > Querying internal corrupt record column should not be allowed in filter > operation > - > > Key: SPARK-29621 > URL: https://issues.apache.org/jira/browse/SPARK-29621 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Major > Labels: PySpark, SparkSQL > > As per > *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*, > _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when > the referenced columns only include the internal corrupt record column"_ > But it's allowing while querying only the internal corrupt record column in > case of *filter* operation. > from pyspark.sql.types import * > schema = > StructType([StructField("_corrupt_record",StringType(),False),StructField("Name",StringType(),False),StructField("Colour",StringType(),True),StructField("Price",IntegerType(),True),StructField("Quantity",IntegerType(),True)]) > df = spark.read.csv("fruit.csv",schema=schema,mode="PERMISSIVE") > df.filter(df._corrupt_record.isNotNull()).show() // Allowed -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation
Suchintak Patnaik created SPARK-29621: - Summary: Querying internal corrupt record column should not be allowed in filter operation Key: SPARK-29621 URL: https://issues.apache.org/jira/browse/SPARK-29621 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.0 Reporter: Suchintak Patnaik As per *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*, _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column"_ But it's allowing while querying only the internal corrupt record column in case of *filter* operation. from pyspark.sql.types import * schema = StructType([StructField("_corrupt_record",StringType(),False),StructField("Name",StringType(),False),StructField("Colour",StringType(),True),StructField("Price",IntegerType(),True),StructField("Quantity",IntegerType(),True)]) df = spark.read.csv("fruit.csv",schema=schema,mode="PERMISSIVE") df.filter(df._corrupt_record.isNotNull()).show() // Allowed -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29549) Union of DataSourceV2 datasources leads to duplicated results
[ https://issues.apache.org/jira/browse/SPARK-29549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961043#comment-16961043 ] Miguel Molina commented on SPARK-29549: --- Hi [~hyukjin.kwon]. In Spark 2.4 it works correctly, this only affects 2.3.x. > Union of DataSourceV2 datasources leads to duplicated results > - > > Key: SPARK-29549 > URL: https://issues.apache.org/jira/browse/SPARK-29549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4 >Reporter: Miguel Molina >Priority: Major > > Hello! > I've discovered that when two DataSourceV2 data frames in a query of the > exact same shape are joined and there is an aggregation in the query, only > the first results are used. The rest get removed by the ReuseExchange rule > and reuse the results of the first data frame, leading to N times the first > data frame results. > > I've put together a repository with an example project where this can be > reproduced: [https://github.com/erizocosmico/spark-union-issue] > > Basically, doing this: > > {code:java} > val products = spark.sql("SELECT name, COUNT(*) as count FROM products GROUP > BY name") > val users = spark.sql("SELECT name, COUNT(*) as count FROM users GROUP BY > name") > products.union(users) > .select("name") > .show(truncate = false, numRows = 50){code} > > > Where products is: > {noformat} > +-+---+ > |name |id | > +-+---+ > |candy |1 | > |chocolate|2 | > |milk |3 | > |cinnamon |4 | > |pizza |5 | > |pineapple|6 | > +-+---+{noformat} > And users is: > {noformat} > +---+---+ > |name |id | > +---+---+ > |andy |1 | > |alice |2 | > |mike |3 | > |mariah |4 | > |eleanor|5 | > |matthew|6 | > +---+---+ {noformat} > > Results are incorrect: > {noformat} > +-+ > |name | > +-+ > |candy | > |pizza | > |chocolate| > |cinnamon | > |pineapple| > |milk | > |candy | > |pizza | > |chocolate| > |cinnamon | > |pineapple| > |milk | > +-+{noformat} > > This is the plan explained: > > {noformat} > == Parsed Logical Plan == > 'Project [unresolvedalias('name, None)] > +- AnalysisBarrier > +- Union > :- Aggregate [name#0], [name#0, count(1) AS count#8L] > : +- SubqueryAlias products > : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], > [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6])) > +- Aggregate [name#4], [name#4, count(1) AS count#12L] > +- SubqueryAlias users > +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], > [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6])) > == Analyzed Logical Plan == > name: string > Project [name#0] > +- Union > :- Aggregate [name#0], [name#0, count(1) AS count#8L] > : +- SubqueryAlias products > : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], > [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6])) > +- Aggregate [name#4], [name#4, count(1) AS count#12L] > +- SubqueryAlias users > +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], > [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6])) > == Optimized Logical Plan == > Union > :- Aggregate [name#0], [name#0] > : +- Project [name#0] > : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], > [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6])) > +- Aggregate [name#4], [name#4] > +- Project [name#4] > +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], > [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6])) > == Physical Plan == > Union > :- *(2) HashAggregate(keys=[name#0], functions=[], output=[name#0]) > : +- Exchange hashpartitioning(name#0, 200) > : +- *(1) HashAggregate(keys=[name#0], functions=[], output=[name#0]) > : +- *(1) Project [name#0] > : +- *(1) DataSourceV2Scan [name#0, id#1], DefaultReader(List([candy,1], > [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6])) > +- *(4) HashAggregate(keys=[name#4], functions=[], output=[name#4]) > +- ReusedExchange [name#4], Exchange hashpartitioning(name#0, 200) > {noformat} > > > In the physical plan, the first exchange is reused, but it shouldn't be > because both sources are not the same. > > {noformat} > == Physical Plan == > Union > :- *(2) HashAggregate(keys=[name#0], functions=[], output=[name#0]) > : +- Exchange hashpartitioning(name#0, 200) > : +- *(1) HashAggregate(keys=[name#0], functions=[], output=[name#0]) > : +- *(1) Project [name#0] > : +- *(1) DataSourceV2Scan [name#0, id#1], DefaultReader(List([candy,1], > [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6])) > +- *(4) HashAggregate(keys=[name#4], functions=[], output=[name#4]) > +- ReusedExchange [name#4], Exchange hashpartitioning(name#0,
[jira] [Updated] (SPARK-29620) UnsafeKVExternalSorterSuite failure on bigendian system
[ https://issues.apache.org/jira/browse/SPARK-29620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] salamani updated SPARK-29620: - Description: spark/sql/core# ../../build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite test UnsafeKVExternalSorterSuite: 12:24:24.305 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable - kv sorting key schema [] and value schema [] *** FAILED *** java.lang.AssertionError: sizeInBytes (4) should be a multiple of 8 at org.apache.spark.sql.catalyst.expressions.UnsafeRow.pointTo(UnsafeRow.java:168) at org.apache.spark.sql.execution.UnsafeKVExternalSorter$KVSorterIterator.next(UnsafeKVExternalSorter.java:297) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite.org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter(UnsafeKVExternalSorterSuite.scala:145) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply$mcV$sp(UnsafeKVExternalSorterSuite.scala:86) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) ... - kv sorting key schema [int] and value schema [] *** FAILED *** java.lang.AssertionError: sizeInBytes (20) should be a multiple of 8 at org.apache.spark.sql.catalyst.expressions.UnsafeRow.pointTo(UnsafeRow.java:168) at org.apache.spark.sql.execution.UnsafeKVExternalSorter$KVSorterIterator.next(UnsafeKVExternalSorter.java:297) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite.org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter(UnsafeKVExternalSorterSuite.scala:145) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply$mcV$sp(UnsafeKVExternalSorterSuite.scala:86) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) ... - kv sorting key schema [] and value schema [int] *** FAILED *** java.lang.AssertionError: sizeInBytes (20) should be a multiple of 8 at org.apache.spark.sql.catalyst.expressions.UnsafeRow.pointTo(UnsafeRow.java:168) at org.apache.spark.sql.execution.UnsafeKVExternalSorter$KVSorterIterator.next(UnsafeKVExternalSorter.java:297) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite.org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter(UnsafeKVExternalSorterSuite.scala:145) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply$mcV$sp(UnsafeKVExternalSorterSuite.scala:86) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86) at org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) ... - kv sorting key schema [int] and value schema [float,float,double,string,float] *** FAILED *** java.lang.AssertionError: sizeInBytes (2732) should be a multiple of 8 at org.apache.spark.sql.catalyst.expressions.UnsafeRow.pointTo(UnsafeRow.java:168) at org.apache.spark.sql.execution.UnsafeKVExternalSorter$KVSorterIterator.next(UnsafeKVExternalSorter.java:297) at
[jira] [Created] (SPARK-29620) UnsafeKVExternalSorterSuite failure on bigendian system
salamani created SPARK-29620: Summary: UnsafeKVExternalSorterSuite failure on bigendian system Key: SPARK-29620 URL: https://issues.apache.org/jira/browse/SPARK-29620 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 2.4.4 Reporter: salamani -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format
[ https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961028#comment-16961028 ] Suchintak Patnaik commented on SPARK-29234: --- [~dongjoon] [~yumwang] Are the PRs backported to Spark version 2.3 and 2.4? > bucketed table created by Spark SQL DataFrame is in SequenceFile format > --- > > Key: SPARK-29234 > URL: https://issues.apache.org/jira/browse/SPARK-29234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Major > > When we create a bucketed table as follows, it's input and output format are > getting displayed as SequenceFile format. But physically the files are > getting created in HDFS as the format specified by the user e.g. > orc,parquet,etc. > df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample") > in Hive, DESCRIBE FORMATTED OrdersExample; > describe formatted ordersExample; > OK > # col_name data_type comment > col array from deserializer > # Storage Information > SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > Querying the same table in Hive is giving error. > select * from OrdersExample; > OK > Failed with exception java.io.IOException:java.io.IOException: > hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc > not a SequenceFile -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29459) Spark takes lot of time to initialize when the network is not connected to internet.
[ https://issues.apache.org/jira/browse/SPARK-29459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29459. -- Resolution: Incomplete 1.5.2 is too old and EOL release. Can you try higher versions and see if it works? > Spark takes lot of time to initialize when the network is not connected to > internet. > > > Key: SPARK-29459 > URL: https://issues.apache.org/jira/browse/SPARK-29459 > Project: Spark > Issue Type: Question > Components: ML >Affects Versions: 1.5.2 >Reporter: Raman >Priority: Major > > Hi, I have faced two problems in using Spark ML when my network is not > connected to the internet (airplane mode). All these issues occurred in > windows 10 and java 8 environment. > # Spark by default searches for the internal IP for various operations. > (like hosting spark web ui). There was some problem initializing spark and I > have fixed this issue by setting "spark.driver.host" to "localhost". > # It takes around 120 secs to initialize the spark. May I know that is there > any way to reduce this time. Like, If there is any param/config which related > to timeout can resolve this? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29476) Add tooltip information for Thread Dump links and Thread details table columns in Executors Tab
[ https://issues.apache.org/jira/browse/SPARK-29476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961007#comment-16961007 ] Hyukjin Kwon commented on SPARK-29476: -- What does this Jira mean? > Add tooltip information for Thread Dump links and Thread details table > columns in Executors Tab > --- > > Key: SPARK-29476 > URL: https://issues.apache.org/jira/browse/SPARK-29476 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29477) Improve tooltip information for Streaming Tab
[ https://issues.apache.org/jira/browse/SPARK-29477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961005#comment-16961005 ] Hyukjin Kwon commented on SPARK-29477: -- Please describe what each Jira means clearly, [~abhishek.akg] > Improve tooltip information for Streaming Tab > -- > > Key: SPARK-29477 > URL: https://issues.apache.org/jira/browse/SPARK-29477 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Active Batches Table and Completed Batches can be re look and put proper tip > for Batch Time and Record column -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29590) Support hiding table in JDBC/ODBC server page in WebUI
[ https://issues.apache.org/jira/browse/SPARK-29590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961006#comment-16961006 ] shahid commented on SPARK-29590: Hi [~hyukjin.kwon], This JIRA is similar to the JIRA https://issues.apache.org/jira/browse/SPARK-25575. This is for supporting in JDBC/ODBC server page > Support hiding table in JDBC/ODBC server page in WebUI > -- > > Key: SPARK-29590 > URL: https://issues.apache.org/jira/browse/SPARK-29590 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.4.4, 3.0.0 >Reporter: shahid >Priority: Minor > > Support hiding table in JDBC/ODBC server page in WebUI -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29505) desc extended is case sensitive
[ https://issues.apache.org/jira/browse/SPARK-29505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29505: - Description: {code} create table customer(id int, name String, *CName String*, address String, city String, pin int, country String); insert into customer values(1,'Alfred','Maria','Obere Str 57','Berlin',12209,'Germany'); insert into customer values(2,'Ana','trujilo','Adva de la','Maxico D.F.',05021,'Maxico'); insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 2312','Maxico D.F.',05023,'Maxico'); analyze table customer compute statistics for columns cname; – *Success( Though cname is not as CName)* desc extended customer cname; – Failed jdbc:hive2://10.18.19.208:23040/default> desc extended customer *cname;* +-+-+ | info_name | info_value | +-+-+ | col_name | cname | | data_type | string | | comment | NULL | | min | NULL | | max | NULL | | num_nulls | NULL | | distinct_count | NULL | | avg_col_len | NULL | | max_col_len | NULL | | histogram | NULL | +-+-- {code} But {code} desc extended customer CName; – SUCCESS 0: jdbc:hive2://10.18.19.208:23040/default> desc extended customer *CName;* +-+-+ | info_name | info_value | +-+-+ | col_name | CName | | data_type | string | | comment | NULL | | min | NULL | | max | NULL | | num_nulls | 0 | | distinct_count | 3 | | avg_col_len | 9 | | max_col_len | 14 | | histogram | NULL | +-+-+ {code} was: {code} create table customer(id int, name String, *CName String*, address String, city String, pin int, country String); insert into customer values(1,'Alfred','Maria','Obere Str 57','Berlin',12209,'Germany'); insert into customer values(2,'Ana','trujilo','Adva de la','Maxico D.F.',05021,'Maxico'); insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 2312','Maxico D.F.',05023,'Maxico'); analyze table customer compute statistics for columns cname; – *Success( Though cname is not as CName)* desc extended customer cname; – Failed jdbc:hive2://10.18.19.208:23040/default> desc extended customer *cname;* +-+-+ | info_name | info_value | +-+-+ | col_name | cname | | data_type | string | | comment | NULL | | min | NULL | | max | NULL | | num_nulls | NULL | | distinct_count | NULL | | avg_col_len | NULL | | max_col_len | NULL | | histogram | NULL | +-+-- {code} But desc extended customer CName; – SUCCESS {code} 0: jdbc:hive2://10.18.19.208:23040/default> desc extended customer *CName;* +-+-+ | info_name | info_value | +-+-+ | col_name | CName | | data_type | string | | comment | NULL | | min | NULL | | max | NULL | | num_nulls | 0 | | distinct_count | 3 | | avg_col_len | 9 | | max_col_len | 14 | | histogram | NULL | +-+-+ {code} > desc extended is case sensitive > -- > > Key: SPARK-29505 > URL: https://issues.apache.org/jira/browse/SPARK-29505 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > {code} > create table customer(id int, name String, *CName String*, address String, > city String, pin int, country String); > insert into customer values(1,'Alfred','Maria','Obere Str > 57','Berlin',12209,'Germany'); > insert into customer values(2,'Ana','trujilo','Adva de la','Maxico > D.F.',05021,'Maxico'); > insert into customer values(3,'Antonio','Antonio Moreno','Mataderos > 2312','Maxico D.F.',05023,'Maxico'); > analyze table customer compute statistics for columns cname; – *Success( > Though cname is not as CName)* > desc extended customer cname; – Failed > jdbc:hive2://10.18.19.208:23040/default> desc extended customer *cname;* > +-+-+ > | info_name | info_value | > +-+-+ > | col_name | cname | > | data_type | string | > | comment | NULL | > | min | NULL | > | max | NULL | > | num_nulls | NULL | > | distinct_count | NULL | > | avg_col_len | NULL | > | max_col_len | NULL | > | histogram | NULL | > +-+-- > {code} > > But > {code} > desc extended customer CName; – SUCCESS > 0: jdbc:hive2://10.18.19.208:23040/default> desc extended customer *CName;* > +-+-+ > | info_name | info_value | > +-+-+ > | col_name | CName | > | data_type | string | > | comment | NULL | > | min | NULL | > | max | NULL | > | num_nulls | 0 | > | distinct_count | 3 | > | avg_col_len | 9 | > | max_col_len | 14 | > | histogram | NULL | >
[jira] [Updated] (SPARK-29505) desc extended is case sensitive
[ https://issues.apache.org/jira/browse/SPARK-29505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29505: - Description: {code} create table customer(id int, name String, *CName String*, address String, city String, pin int, country String); insert into customer values(1,'Alfred','Maria','Obere Str 57','Berlin',12209,'Germany'); insert into customer values(2,'Ana','trujilo','Adva de la','Maxico D.F.',05021,'Maxico'); insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 2312','Maxico D.F.',05023,'Maxico'); analyze table customer compute statistics for columns cname; – *Success( Though cname is not as CName)* desc extended customer cname; – Failed jdbc:hive2://10.18.19.208:23040/default> desc extended customer *cname;* +-+-+ | info_name | info_value | +-+-+ | col_name | cname | | data_type | string | | comment | NULL | | min | NULL | | max | NULL | | num_nulls | NULL | | distinct_count | NULL | | avg_col_len | NULL | | max_col_len | NULL | | histogram | NULL | +-+-- {code} But desc extended customer CName; – SUCCESS {code} 0: jdbc:hive2://10.18.19.208:23040/default> desc extended customer *CName;* +-+-+ | info_name | info_value | +-+-+ | col_name | CName | | data_type | string | | comment | NULL | | min | NULL | | max | NULL | | num_nulls | 0 | | distinct_count | 3 | | avg_col_len | 9 | | max_col_len | 14 | | histogram | NULL | +-+-+ {code} was: create table customer(id int, name String, *CName String*, address String, city String, pin int, country String); insert into customer values(1,'Alfred','Maria','Obere Str 57','Berlin',12209,'Germany'); insert into customer values(2,'Ana','trujilo','Adva de la','Maxico D.F.',05021,'Maxico'); insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 2312','Maxico D.F.',05023,'Maxico'); analyze table customer compute statistics for columns cname; – *Success( Though cname is not as CName)* desc extended customer cname; – Failed jdbc:hive2://10.18.19.208:23040/default> desc extended customer *cname;* +-+-+ | info_name | info_value | +-+-+ | col_name | cname | | data_type | string | | comment | NULL | | min | NULL | | max | NULL | | num_nulls | NULL | | distinct_count | NULL | | avg_col_len | NULL | | max_col_len | NULL | | histogram | NULL | +-+-- But desc extended customer CName; – SUCCESS 0: jdbc:hive2://10.18.19.208:23040/default> desc extended customer *CName;* +-+-+ | info_name | info_value | +-+-+ | col_name | CName | | data_type | string | | comment | NULL | | min | NULL | | max | NULL | | num_nulls | 0 | | distinct_count | 3 | | avg_col_len | 9 | | max_col_len | 14 | | histogram | NULL | +-+-+ > desc extended is case sensitive > -- > > Key: SPARK-29505 > URL: https://issues.apache.org/jira/browse/SPARK-29505 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > {code} > create table customer(id int, name String, *CName String*, address String, > city String, pin int, country String); > insert into customer values(1,'Alfred','Maria','Obere Str > 57','Berlin',12209,'Germany'); > insert into customer values(2,'Ana','trujilo','Adva de la','Maxico > D.F.',05021,'Maxico'); > insert into customer values(3,'Antonio','Antonio Moreno','Mataderos > 2312','Maxico D.F.',05023,'Maxico'); > analyze table customer compute statistics for columns cname; – *Success( > Though cname is not as CName)* > desc extended customer cname; – Failed > jdbc:hive2://10.18.19.208:23040/default> desc extended customer *cname;* > +-+-+ > | info_name | info_value | > +-+-+ > | col_name | cname | > | data_type | string | > | comment | NULL | > | min | NULL | > | max | NULL | > | num_nulls | NULL | > | distinct_count | NULL | > | avg_col_len | NULL | > | max_col_len | NULL | > | histogram | NULL | > +-+-- > {code} > > But > desc extended customer CName; – SUCCESS > {code} > 0: jdbc:hive2://10.18.19.208:23040/default> desc extended customer *CName;* > +-+-+ > | info_name | info_value | > +-+-+ > | col_name | CName | > | data_type | string | > | comment | NULL | > | min | NULL | > | max | NULL | > | num_nulls | 0 | > | distinct_count | 3 | > | avg_col_len | 9 | > | max_col_len | 14 | > | histogram | NULL | > +-+-+ > {code} >
[jira] [Updated] (SPARK-29510) JobGroup ID is not set for the job submitted from Spark-SQL and Spark -Shell
[ https://issues.apache.org/jira/browse/SPARK-29510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29510: - Description: When user submit jobs from Spark-shell or SQL Job group id is not set. UI Screen shot attached. But from beeline job Group ID is set. Steps: {code:java} create table customer(id int, name String, CName String, address String, city String, pin int, country String); insert into customer values(1,'Alfred','Maria','Obere Str 57','Berlin',12209,'Germany'); insert into customer values(2,'Ana','trujilo','Adva de la','Maxico D.F.',05021,'Maxico'); insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 2312','Maxico D.F.',05023,'Maxico'); {code} was: When user submit jobs from Spark-shell or SQL Job group id is not set. UI Screen shot attached. But from beeline job Group ID is set. Steps: create table customer(id int, name String, CName String, address String, city String, pin int, country String); insert into customer values(1,'Alfred','Maria','Obere Str 57','Berlin',12209,'Germany'); insert into customer values(2,'Ana','trujilo','Adva de la','Maxico D.F.',05021,'Maxico'); insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 2312','Maxico D.F.',05023,'Maxico'); > JobGroup ID is not set for the job submitted from Spark-SQL and Spark -Shell > > > Key: SPARK-29510 > URL: https://issues.apache.org/jira/browse/SPARK-29510 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: JobGroup1.png, JobGroup2.png, JobGroup3.png > > > When user submit jobs from Spark-shell or SQL Job group id is not set. UI > Screen shot attached. > But from beeline job Group ID is set. > Steps: > {code:java} > create table customer(id int, name String, CName String, address String, city > String, pin int, country String); > insert into customer values(1,'Alfred','Maria','Obere Str > 57','Berlin',12209,'Germany'); > insert into customer values(2,'Ana','trujilo','Adva de la','Maxico > D.F.',05021,'Maxico'); > insert into customer values(3,'Antonio','Antonio Moreno','Mataderos > 2312','Maxico D.F.',05023,'Maxico'); {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29534) Hanging tasks in DiskBlockObjectWriter.commitAndGet while calling native FileDispatcherImpl.size0
[ https://issues.apache.org/jira/browse/SPARK-29534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961001#comment-16961001 ] Hyukjin Kwon commented on SPARK-29534: -- Can you post the codes you ran as well? > Hanging tasks in DiskBlockObjectWriter.commitAndGet while calling native > FileDispatcherImpl.size0 > -- > > Key: SPARK-29534 > URL: https://issues.apache.org/jira/browse/SPARK-29534 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.2 >Reporter: Bogdan >Priority: Major > > Tasks are hanging in native _FileDispatcherImpl.size0_ with calling site from > _DiskBlockObjectWriter.commitAndGet_. Maybe enhance _DiskBlockObjectWriter_ > to handle this type of behaviour? > > The behaviour is consistent and happens on a daily basis. It has been > temporarily addressed by using speculative execution. However, for longer > tasks (1h+) the impact in runtime is significant. > > |sun.nio.ch.FileDispatcherImpl.size0(Native Method) > sun.nio.ch.FileDispatcherImpl.size(FileDispatcherImpl.java:88) > sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:264) => holding > Monitor(java.lang.Object@853892634}) > > org.apache.spark.storage.DiskBlockObjectWriter.commitAndGet(DiskBlockObjectWriter.scala:183) > > org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:204) > > org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:272) > org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65) > > org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:403) > > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:267) > > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:188) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > org.apache.spark.scheduler.Task.run(Task.scala:121) > > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748)| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29549) Union of DataSourceV2 datasources leads to duplicated results
[ https://issues.apache.org/jira/browse/SPARK-29549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960999#comment-16960999 ] Hyukjin Kwon commented on SPARK-29549: -- Hi [~erizocosmico]. Spark 2.3.x is EOL. Can you try it out in 2.4? > Union of DataSourceV2 datasources leads to duplicated results > - > > Key: SPARK-29549 > URL: https://issues.apache.org/jira/browse/SPARK-29549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4 >Reporter: Miguel Molina >Priority: Major > > Hello! > I've discovered that when two DataSourceV2 data frames in a query of the > exact same shape are joined and there is an aggregation in the query, only > the first results are used. The rest get removed by the ReuseExchange rule > and reuse the results of the first data frame, leading to N times the first > data frame results. > > I've put together a repository with an example project where this can be > reproduced: [https://github.com/erizocosmico/spark-union-issue] > > Basically, doing this: > > {code:java} > val products = spark.sql("SELECT name, COUNT(*) as count FROM products GROUP > BY name") > val users = spark.sql("SELECT name, COUNT(*) as count FROM users GROUP BY > name") > products.union(users) > .select("name") > .show(truncate = false, numRows = 50){code} > > > Where products is: > {noformat} > +-+---+ > |name |id | > +-+---+ > |candy |1 | > |chocolate|2 | > |milk |3 | > |cinnamon |4 | > |pizza |5 | > |pineapple|6 | > +-+---+{noformat} > And users is: > {noformat} > +---+---+ > |name |id | > +---+---+ > |andy |1 | > |alice |2 | > |mike |3 | > |mariah |4 | > |eleanor|5 | > |matthew|6 | > +---+---+ {noformat} > > Results are incorrect: > {noformat} > +-+ > |name | > +-+ > |candy | > |pizza | > |chocolate| > |cinnamon | > |pineapple| > |milk | > |candy | > |pizza | > |chocolate| > |cinnamon | > |pineapple| > |milk | > +-+{noformat} > > This is the plan explained: > > {noformat} > == Parsed Logical Plan == > 'Project [unresolvedalias('name, None)] > +- AnalysisBarrier > +- Union > :- Aggregate [name#0], [name#0, count(1) AS count#8L] > : +- SubqueryAlias products > : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], > [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6])) > +- Aggregate [name#4], [name#4, count(1) AS count#12L] > +- SubqueryAlias users > +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], > [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6])) > == Analyzed Logical Plan == > name: string > Project [name#0] > +- Union > :- Aggregate [name#0], [name#0, count(1) AS count#8L] > : +- SubqueryAlias products > : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], > [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6])) > +- Aggregate [name#4], [name#4, count(1) AS count#12L] > +- SubqueryAlias users > +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], > [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6])) > == Optimized Logical Plan == > Union > :- Aggregate [name#0], [name#0] > : +- Project [name#0] > : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], > [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6])) > +- Aggregate [name#4], [name#4] > +- Project [name#4] > +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], > [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6])) > == Physical Plan == > Union > :- *(2) HashAggregate(keys=[name#0], functions=[], output=[name#0]) > : +- Exchange hashpartitioning(name#0, 200) > : +- *(1) HashAggregate(keys=[name#0], functions=[], output=[name#0]) > : +- *(1) Project [name#0] > : +- *(1) DataSourceV2Scan [name#0, id#1], DefaultReader(List([candy,1], > [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6])) > +- *(4) HashAggregate(keys=[name#4], functions=[], output=[name#4]) > +- ReusedExchange [name#4], Exchange hashpartitioning(name#0, 200) > {noformat} > > > In the physical plan, the first exchange is reused, but it shouldn't be > because both sources are not the same. > > {noformat} > == Physical Plan == > Union > :- *(2) HashAggregate(keys=[name#0], functions=[], output=[name#0]) > : +- Exchange hashpartitioning(name#0, 200) > : +- *(1) HashAggregate(keys=[name#0], functions=[], output=[name#0]) > : +- *(1) Project [name#0] > : +- *(1) DataSourceV2Scan [name#0, id#1], DefaultReader(List([candy,1], > [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6])) > +- *(4) HashAggregate(keys=[name#4], functions=[], output=[name#4]) > +- ReusedExchange [name#4], Exchange hashpartitioning(name#0, 200){noformat} >
[jira] [Commented] (SPARK-29561) Large Case Statement Code Generation OOM
[ https://issues.apache.org/jira/browse/SPARK-29561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960996#comment-16960996 ] Hyukjin Kwon commented on SPARK-29561: -- Seems like it was just the leak of memory. Does it work when you increase the memory? > Large Case Statement Code Generation OOM > > > Key: SPARK-29561 > URL: https://issues.apache.org/jira/browse/SPARK-29561 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Michael Chen >Priority: Major > Attachments: apacheSparkCase.sql > > > Spark Configuration > spark.driver.memory = 1g > spark.master = "local" > spark.deploy.mode = "client" > Try to execute a case statement with 3000+ branches. Added sql statement as > attachment > Spark runs for a while before it OOM > {noformat} > java.lang.OutOfMemoryError: GC overhead limit exceeded > at > org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:182) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320) > at > org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178) > at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73) > 19/10/22 16:19:54 ERROR FileFormatWriter: Aborting job null. > java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.util.HashMap.newNode(HashMap.java:1750) > at java.util.HashMap.putVal(HashMap.java:631) > at java.util.HashMap.putMapEntries(HashMap.java:515) > at java.util.HashMap.putAll(HashMap.java:785) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3345) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$8.visitLocalVariableDeclarationStatement(UnitCompiler.java:3230) > at > org.codehaus.janino.UnitCompiler$8.visitLocalVariableDeclarationStatement(UnitCompiler.java:3198) > at > org.codehaus.janino.Java$LocalVariableDeclarationStatement.accept(Java.java:3351) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3254) > at org.codehaus.janino.UnitCompiler.access$3900(UnitCompiler.java:212) > at org.codehaus.janino.UnitCompiler$8.visitBlock(UnitCompiler.java:3216) > at org.codehaus.janino.UnitCompiler$8.visitBlock(UnitCompiler.java:3198) > at org.codehaus.janino.Java$Block.accept(Java.java:2756) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3260) > at org.codehaus.janino.UnitCompiler.access$4000(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$8.visitDoStatement(UnitCompiler.java:3217) > at > org.codehaus.janino.UnitCompiler$8.visitDoStatement(UnitCompiler.java:3198) > at org.codehaus.janino.Java$DoStatement.accept(Java.java:3304) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3186) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3009) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958) > at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:393) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:385) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1286) > 19/10/22 16:19:54 ERROR Utils: throw uncaught fatal error in thread Spark > Context Cleaner > java.lang.OutOfMemoryError: GC overhead limit exceeded > at > org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:182) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320) > at > org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178) > at > org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73){noformat} > Generated code looks like > {noformat} > /* 029 */ private void project_doConsume(InternalRow scan_row, UTF8String >
[jira] [Commented] (SPARK-29573) Spark should work as PostgreSQL when using + Operator
[ https://issues.apache.org/jira/browse/SPARK-29573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960995#comment-16960995 ] Hyukjin Kwon commented on SPARK-29573: -- [~abhishek.akg] can you please use \{code\} ... \{code\} block? > Spark should work as PostgreSQL when using + Operator > - > > Key: SPARK-29573 > URL: https://issues.apache.org/jira/browse/SPARK-29573 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Spark and PostgreSQL result is different when concatenating as below : > Spark : Giving NULL result > {code} > 0: jdbc:hive2://10.18.19.208:23040/default> select * from emp12; > +-+-+ > | id | name | > +-+-+ > | 20 | test | > | 10 | number | > +-+-+ > 2 rows selected (3.683 seconds) > 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as > address from emp12; > +-+--+ > | ID | address | > +-+--+ > | 20 | NULL | > | 10 | NULL | > +-+--+ > 2 rows selected (0.649 seconds) > 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as > address from emp12; > +-+--+ > | ID | address | > +-+--+ > | 20 | NULL | > | 10 | NULL | > +-+--+ > 2 rows selected (0.406 seconds) > 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+','+name as > address from emp12; > +-+--+ > | ID | address | > +-+--+ > | 20 | NULL | > | 10 | NULL | > +-+--+ > {code} > > PostgreSQL: Saying throwing Error saying not supported > {code} > create table emp12(id int,name varchar(255)); > insert into emp12 values(10,'number'); > insert into emp12 values(20,'test'); > select id as ID, id+','+name as address from emp12; > output: invalid input syntax for integer: "," > create table emp12(id int,name varchar(255)); > insert into emp12 values(10,'number'); > insert into emp12 values(20,'test'); > select id as ID, id+name as address from emp12; > Output: 42883: operator does not exist: integer + character varying > {code} > > It should throw Error in Spark if it is not supported. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29573) Spark should work as PostgreSQL when using + Operator
[ https://issues.apache.org/jira/browse/SPARK-29573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29573: - Description: Spark and PostgreSQL result is different when concatenating as below : Spark : Giving NULL result {code} 0: jdbc:hive2://10.18.19.208:23040/default> select * from emp12; +-+-+ | id | name | +-+-+ | 20 | test | | 10 | number | +-+-+ 2 rows selected (3.683 seconds) 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as address from emp12; +-+--+ | ID | address | +-+--+ | 20 | NULL | | 10 | NULL | +-+--+ 2 rows selected (0.649 seconds) 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as address from emp12; +-+--+ | ID | address | +-+--+ | 20 | NULL | | 10 | NULL | +-+--+ 2 rows selected (0.406 seconds) 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+','+name as address from emp12; +-+--+ | ID | address | +-+--+ | 20 | NULL | | 10 | NULL | +-+--+ {code} PostgreSQL: Saying throwing Error saying not supported {code} create table emp12(id int,name varchar(255)); insert into emp12 values(10,'number'); insert into emp12 values(20,'test'); select id as ID, id+','+name as address from emp12; output: invalid input syntax for integer: "," create table emp12(id int,name varchar(255)); insert into emp12 values(10,'number'); insert into emp12 values(20,'test'); select id as ID, id+name as address from emp12; Output: 42883: operator does not exist: integer + character varying {code} It should throw Error in Spark if it is not supported. was: Spark and PostgreSQL result is different when concatenating as below : Spark : Giving NULL result 0: jdbc:hive2://10.18.19.208:23040/default> select * from emp12; +-+-+ | id | name | +-+-+ | 20 | test | | 10 | number | +-+-+ 2 rows selected (3.683 seconds) 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as address from emp12; +-+--+ | ID | address | +-+--+ | 20 | NULL | | 10 | NULL | +-+--+ 2 rows selected (0.649 seconds) 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as address from emp12; +-+--+ | ID | address | +-+--+ | 20 | NULL | | 10 | NULL | +-+--+ 2 rows selected (0.406 seconds) 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+','+name as address from emp12; +-+--+ | ID | address | +-+--+ | 20 | NULL | | 10 | NULL | +-+--+ PostgreSQL: Saying throwing Error saying not supported create table emp12(id int,name varchar(255)); insert into emp12 values(10,'number'); insert into emp12 values(20,'test'); select id as ID, id+','+name as address from emp12; output: invalid input syntax for integer: "," create table emp12(id int,name varchar(255)); insert into emp12 values(10,'number'); insert into emp12 values(20,'test'); select id as ID, id+name as address from emp12; Output: 42883: operator does not exist: integer + character varying It should throw Error in Spark if it is not supported. > Spark should work as PostgreSQL when using + Operator > - > > Key: SPARK-29573 > URL: https://issues.apache.org/jira/browse/SPARK-29573 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Spark and PostgreSQL result is different when concatenating as below : > Spark : Giving NULL result > {code} > 0: jdbc:hive2://10.18.19.208:23040/default> select * from emp12; > +-+-+ > | id | name | > +-+-+ > | 20 | test | > | 10 | number | > +-+-+ > 2 rows selected (3.683 seconds) > 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as > address from emp12; > +-+--+ > | ID | address | > +-+--+ > | 20 | NULL | > | 10 | NULL | > +-+--+ > 2 rows selected (0.649 seconds) > 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as > address from emp12; > +-+--+ > | ID | address | > +-+--+ > | 20 | NULL | > | 10 | NULL | > +-+--+ > 2 rows selected (0.406 seconds) > 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+','+name as > address from emp12; > +-+--+ > | ID | address | > +-+--+ > | 20 | NULL | > | 10 | NULL | > +-+--+ > {code} > > PostgreSQL: Saying throwing Error saying not supported > {code} > create table emp12(id int,name varchar(255)); > insert into emp12 values(10,'number'); > insert into emp12 values(20,'test'); > select id as ID,
[jira] [Updated] (SPARK-29575) from_json can produce nulls for fields which are marked as non-nullable
[ https://issues.apache.org/jira/browse/SPARK-29575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29575: - Component/s: SQL > from_json can produce nulls for fields which are marked as non-nullable > --- > > Key: SPARK-29575 > URL: https://issues.apache.org/jira/browse/SPARK-29575 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Victor Lopez >Priority: Major > > I believe this issue was resolved elsewhere > (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this > bug seems to still be there. > The issue appears when using {{from_json}} to parse a column in a Spark > dataframe. It seems like {{from_json}} ignores whether the schema provided > has any {{nullable:False}} property. > {code:java} > schema = T.StructType().add(T.StructField('id', T.LongType(), > nullable=False)).add(T.StructField('name', T.StringType(), nullable=False)) > data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': > 'jane'})}] > df = spark.read.json(sc.parallelize(data)) > df.withColumn("details", F.from_json("user", > schema)).select("details.*").show() > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29590) Support hiding table in JDBC/ODBC server page in WebUI
[ https://issues.apache.org/jira/browse/SPARK-29590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960992#comment-16960992 ] Hyukjin Kwon commented on SPARK-29590: -- [~shahid] what does this JIRA mean? > Support hiding table in JDBC/ODBC server page in WebUI > -- > > Key: SPARK-29590 > URL: https://issues.apache.org/jira/browse/SPARK-29590 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.4.4, 3.0.0 >Reporter: shahid >Priority: Minor > > Support hiding table in JDBC/ODBC server page in WebUI -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29598) Support search option for statistics in JDBC/ODBC server page
[ https://issues.apache.org/jira/browse/SPARK-29598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960991#comment-16960991 ] Hyukjin Kwon commented on SPARK-29598: -- [~jobitmathew] can you please elaborate what this JIRA means? > Support search option for statistics in JDBC/ODBC server page > - > > Key: SPARK-29598 > URL: https://issues.apache.org/jira/browse/SPARK-29598 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.4.4, 3.0.0 >Reporter: jobit mathew >Priority: Minor > > Support search option for statistics in JDBC/ODBC server page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29600) array_contains built in function is not backward compatible in 3.0
[ https://issues.apache.org/jira/browse/SPARK-29600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960988#comment-16960988 ] Hyukjin Kwon commented on SPARK-29600: -- Please go ahead but please just don't copy and paste. Narrow down the problem with analysis otherwise no one can investigate. > array_contains built in function is not backward compatible in 3.0 > -- > > Key: SPARK-29600 > URL: https://issues.apache.org/jira/browse/SPARK-29600 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); throws > Exception in 3.0 where as in 2.3.2 is working fine. > Spark 3.0 output: > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT > array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); > Error: org.apache.spark.sql.AnalysisException: cannot resolve > 'array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1BD AS DECIMAL(13,3)), > CAST(0.2BD AS DECIMAL(13,3)), CAST(0.3BD AS DECIMAL(13,3)), CAST(0.5BD AS > DECIMAL(13,3)), CAST(0.02BD AS DECIMAL(13,3)), CAST(0.033BD AS > DECIMAL(13,3))), 0.2BD)' due to data type mismatch: Input to function > array_contains should have been array followed by a value with same element > type, but it's [array, decimal(1,1)].; line 1 pos 7; > 'Project [unresolvedalias(array_contains(array(cast(0 as decimal(13,3)), > cast(0.1 as decimal(13,3)), cast(0.2 as decimal(13,3)), cast(0.3 as > decimal(13,3)), cast(0.5 as decimal(13,3)), cast(0.02 as decimal(13,3)), > cast(0.033 as decimal(13,3))), 0.2), None)] > +- OneRowRelation (state=,code=0) > 0: jdbc:hive2://10.18.19.208:23040/default> ne 1 pos 7; > Error: org.apache.spark.sql.catalyst.parser.ParseException: > > mismatched input 'ne' expecting \{'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', > 'CLEAR', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', > 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', > 'LOCK', 'MAP', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', > 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', > 'UNLOCK', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) > == SQL == > ne 1 pos 7 > ^^^ (state=,code=0) > > Spark 2.3.2 output > 0: jdbc:hive2://10.18.18.214:23040/default> SELECT > array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); > +-+--+ > | array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1 AS DECIMAL(13,3)), > CAST(0.2 AS DECIMAL(13,3)), CAST(0.3 AS DECIMAL(13,3)), CAST(0.5 AS > DECIMAL(13,3)), CAST(0.02 AS DECIMAL(13,3)), CAST(0.033 AS DECIMAL(13,3))), > CAST(0.2 AS DECIMAL(13,3))) | > +-+--+ > | true | > +-+--+ > 1 row selected (0.18 seconds) > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29601) JDBC ODBC Tab Statement column should be provided ellipsis for bigger SQL statement
[ https://issues.apache.org/jira/browse/SPARK-29601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960987#comment-16960987 ] Hyukjin Kwon commented on SPARK-29601: -- Please go ahead but [~abhishek.akg] can you please don't copy and paste the full query? It's impossible to reproduce and debug. > JDBC ODBC Tab Statement column should be provided ellipsis for bigger SQL > statement > --- > > Key: SPARK-29601 > URL: https://issues.apache.org/jira/browse/SPARK-29601 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Statement column in JDBC/ODBC gives whole SQL statement and page size > increases. > Suppose user submit and TPCDS Queries, then Page it display whole Query under > statement and User Experience is not good. > Expected: > It should display the ... Ellipsis and on clicking the stmt. it should Expand > display the whole SQL Statement. > e.g. > SELECT * FROM (SELECT count(*) h8_30_to_9 FROM store_sales, > household_demographics, time_dim, store WHERE ss_sold_time_sk = > time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND > ss_store_sk = s_store_sk AND time_dim.t_hour = 8 AND time_dim.t_minute >= 30 > AND ((household_demographics.hd_dep_count = 4 AND > household_demographics.hd_vehicle_count <= 4 + 2) OR > (household_demographics.hd_dep_count = 2 AND > household_demographics.hd_vehicle_count <= 2 + 2) OR > (household_demographics.hd_dep_count = 0 AND > household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = > 'ese') s1, (SELECT count(*) h9_to_9_30 FROM store_sales, > household_demographics, time_dim, store WHERE ss_sold_time_sk = > time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND > ss_store_sk = s_store_sk AND time_dim.t_hour = 9 AND time_dim.t_minute < 30 > AND ((household_demographics.hd_dep_count = 4 AND > household_demographics.hd_vehicle_count <= 4 + 2) OR > (household_demographics.hd_dep_count = 2 AND > household_demographics.hd_vehicle_count <= 2 + 2) OR > (household_demographics.hd_dep_count = 0 AND > household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = > 'ese') s2, (SELECT count(*) h9_30_to_10 FROM store_sales, > household_demographics, time_dim, store WHERE ss_sold_time_sk = > time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND > ss_store_sk = s_store_sk AND time_dim.t_hour = 9 AND time_dim.t_minute >= 30 > AND ((household_demographics.hd_dep_count = 4 AND > household_demographics.hd_vehicle_count <= 4 + 2) OR > (household_demographics.hd_dep_count = 2 AND > household_demographics.hd_vehicle_count <= 2 + 2) OR > (household_demographics.hd_dep_count = 0 AND > household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = > 'ese') s3,(SELECT count(*) h10_to_10_30 FROM store_sales, > household_demographics, time_dim, store WHERE ss_sold_time_sk = > time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND > ss_store_sk = s_store_sk AND time_dim.t_hour = 10 AND time_dim.t_minute < 30 > AND ((household_demographics.hd_dep_count = 4 AND > household_demographics.hd_vehicle_count <= 4 + 2) OR > (household_demographics.hd_dep_count = 2 AND > household_demographics.hd_vehicle_count <= 2 + 2) OR > (household_demographics.hd_dep_count = 0 AND > household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = > 'ese') s4,(SELECT count(*) h10_30_to_11 FROM > store_sales,household_demographics, time_dim, store WHERE ss_sold_time_sk = > time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND > ss_store_sk = s_store_sk AND time_dim.t_hour = 10 AND time_dim.t_minute >= 30 > AND ((household_demographics.hd_dep_count = 4 AND > household_demographics.hd_vehicle_count <= 4 + 2) OR > (household_demographics.hd_dep_count = 2 AND > household_demographics.hd_vehicle_count <= 2 + 2) OR > (household_demographics.hd_dep_count = 0 AND > household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = > 'ese') s5,(SELECT count(*) h11_to_11_30 FROM store_sales, > household_demographics, time_dim, store WHERE ss_sold_time_sk = > time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND > ss_store_sk = s_store_sk AND time_dim.t_hour = 11 AND time_dim.t_minute < 30 > AND ((household_demographics.hd_dep_count = 4 AND > household_demographics.hd_vehicle_count <= 4 + 2) OR > (household_demographics.hd_dep_count = 2 AND > household_demographics.hd_vehicle_count <= 2 + 2) OR > (household_demographics.hd_dep_count = 0 AND > household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = > 'ese') s6,(SELECT count(*) h11_30_to_12 FROM
[jira] [Resolved] (SPARK-29602) How does the spark from_json json and dataframe transform ignore the case of the json key
[ https://issues.apache.org/jira/browse/SPARK-29602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29602. -- Resolution: Invalid Please ask questions to stackoverflow or mailing list. > How does the spark from_json json and dataframe transform ignore the case of > the json key > - > > Key: SPARK-29602 > URL: https://issues.apache.org/jira/browse/SPARK-29602 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: ruiliang >Priority: Major > Labels: spark-sql > Original Estimate: 12h > Remaining Estimate: 12h > > How does the spark from_json json and dataframe transform ignore the case of > the json key > code > {code:java} > def main(args: Array[String]): Unit = { > val spark = SparkSession.builder().master("local[*]"). > enableHiveSupport().getOrCreate() > //spark.sqlContext.setConf("spark.sql.caseSensitive", "false") > import spark.implicits._ > //hive table data Lower case automatically when saving > val hivetable = > > """{"deliverysystype":"dms","orderid":"B0001-N103-000-005882-RL3AI2RWCP","storeid":103,"timestamp":1571587522000,"":"dms"}""" > val hiveDF = Seq(hivetable).toDF("msg") > val rdd = hiveDF.rdd.map(_.getString(0)) > val jsonDataDF = spark.read.json(rdd.toDS()) > jsonDataDF.show(false) > > //++---++---+-+ > //||deliverysystype|orderid |storeid|timestamp > | > > //++---++---+-+ > //|dms |dms|B0001-N103-000-005882-RL3AI2RWCP|103 > |1571587522000| > > //++---++---+-+ > val jsonstr = > > """{"data":{"deliverySysType":"dms","orderId":"B0001-N103-000-005882-RL3AI2RWCP","storeId":103,"timestamp":1571587522000},"accessKey":"f9d069861dfb1678","actionName":"candao.rider.getDeliveryInfo","sign":"fa0239c75e065cf43d0a4040665578ba" > }""" > val jsonStrDF = Seq(jsonstr).toDF("msg") > //转换json数据列 action_nameactionName > jsonStrDF.show(false) > val structSeqSchme = StructType(Seq(StructField("data", jsonDataDF.schema, > true), > StructField("accessKey", StringType, true), //这里应该 accessKey > StructField("actionName", StringType, true), > StructField("columnNameOfCorruptRecord", StringType, true) > )) > //hive col name lower case, json data key capital and small letter,Take > less than value > val mapOption = Map("allowBackslashEscapingAnyCharacter" -> "true", > "allowUnquotedControlChars" -> "true", "allowSingleQuotes" -> "true") > //I'm not doing anything here, but I don't know how to set a value, right? > val newDF = jsonStrDF.withColumn("data_col", from_json(col("msg"), > structSeqSchme, mapOption)) > newDF.show(false) > newDF.printSchema() > newDF.select($"data_col.accessKey", $"data_col.actionName", > $"data_col.data.*", $"data_col.columnNameOfCorruptRecord").show(false) > //Lowercase columns do not fetch data. How do you make it ignore lowercase > columns? deliverysystype,storeid-> null > > //++++---+---+---+-+-+ > //|accessKey |actionName > ||deliverysystype|orderid|storeid|timestamp|columnNameOfCorruptRecord| > > //++++---+---+---+-+-+ > //|f9d069861dfb1678|candao.rider.getDeliveryInfo|null|null |null > |null |1571587522000|null | > > //++++---+---+---+-+-+ > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29610) Keys with Null values are discarded when using to_json function
[ https://issues.apache.org/jira/browse/SPARK-29610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29610. -- Resolution: Duplicate > Keys with Null values are discarded when using to_json function > --- > > Key: SPARK-29610 > URL: https://issues.apache.org/jira/browse/SPARK-29610 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4 >Reporter: Jonathan >Priority: Major > > When calling to_json on a Struct if a key has Null as a value then the key is > thrown away. > {code:java} > import pyspark > import pyspark.sql.functions as F > l = [("a", "foo"), ("b", None)] > df = spark.createDataFrame(l, ["id", "data"]) > ( > df.select(F.struct("*").alias("payload")) > .withColumn("payload", > F.to_json(F.col("payload")) > ).select("payload") > .show() > ){code} > Produces the following output: > {noformat} > ++ > | payload| > ++ > |{"id":"a","data":...| > | {"id":"b"}| > ++{noformat} > The `data` key in the second row has just been silently deleted. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28158) Hive UDFs supports UDT type
[ https://issues.apache.org/jira/browse/SPARK-28158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28158. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 24961 [https://github.com/apache/spark/pull/24961] > Hive UDFs supports UDT type > --- > > Key: SPARK-28158 > URL: https://issues.apache.org/jira/browse/SPARK-28158 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Genmao Yu >Assignee: Genmao Yu >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28158) Hive UDFs supports UDT type
[ https://issues.apache.org/jira/browse/SPARK-28158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28158: Assignee: Genmao Yu > Hive UDFs supports UDT type > --- > > Key: SPARK-28158 > URL: https://issues.apache.org/jira/browse/SPARK-28158 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Genmao Yu >Assignee: Genmao Yu >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29528) Upgrade scala-maven-plugin to 4.2.4 for Scala 2.13.1
[ https://issues.apache.org/jira/browse/SPARK-29528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29528: - Fix Version/s: (was: 3.0.0) > Upgrade scala-maven-plugin to 4.2.4 for Scala 2.13.1 > > > Key: SPARK-29528 > URL: https://issues.apache.org/jira/browse/SPARK-29528 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > Scala 2.13.1 seems to break the binary compatibility. > We need to upgrade `scala-maven-plugin` to bring the the following fixes for > the latest Scala 2.13.1. > - https://github.com/davidB/scala-maven-plugin/issues/363 > - https://github.com/sbt/zinc/issues/698 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29528) Upgrade scala-maven-plugin to 4.2.4 for Scala 2.13.1
[ https://issues.apache.org/jira/browse/SPARK-29528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960958#comment-16960958 ] Hyukjin Kwon commented on SPARK-29528: -- Reverted at https://github.com/apache/spark/commit/a8d5134981ef176965101e96beba71a9979a554f > Upgrade scala-maven-plugin to 4.2.4 for Scala 2.13.1 > > > Key: SPARK-29528 > URL: https://issues.apache.org/jira/browse/SPARK-29528 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > Scala 2.13.1 seems to break the binary compatibility. > We need to upgrade `scala-maven-plugin` to bring the the following fixes for > the latest Scala 2.13.1. > - https://github.com/davidB/scala-maven-plugin/issues/363 > - https://github.com/sbt/zinc/issues/698 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-29528) Upgrade scala-maven-plugin to 4.2.4 for Scala 2.13.1
[ https://issues.apache.org/jira/browse/SPARK-29528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-29528: -- > Upgrade scala-maven-plugin to 4.2.4 for Scala 2.13.1 > > > Key: SPARK-29528 > URL: https://issues.apache.org/jira/browse/SPARK-29528 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > Scala 2.13.1 seems to break the binary compatibility. > We need to upgrade `scala-maven-plugin` to bring the the following fixes for > the latest Scala 2.13.1. > - https://github.com/davidB/scala-maven-plugin/issues/363 > - https://github.com/sbt/zinc/issues/698 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29619) Improve the exception message when reading the daemon port and add retry times.
[ https://issues.apache.org/jira/browse/SPARK-29619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-29619: --- Description: In production environment, my pyspark application occurs an exception and it's message as below: {code:java} 19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: No port number in pyspark.daemon's stdout at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745){code} At first, I think a physical node has many ports are occupied by a large number of processes. But I found the total number of ports in use is only 671. {code:java} [yarn@r1115 ~]$ netstat -a | wc -l 671 {code} I checked the code of PythonWorkerFactory in line 204 and found: {code:java} daemon = pb.start() val in = new DataInputStream(daemon.getInputStream) try { daemonPort = in.readInt() } catch { case _: EOFException => throw new SparkException(s"No port number in $daemonModule's stdout") } {code} I added some code here: {code:java} logError("Meet EOFException, daemon is alive: ${daemon.isAlive()}") logError("Exit value: ${daemon.exitValue()}") {code} Then I recurrent the exception and it's message as below: {code:java} 19/10/28 16:15:03 ERROR PythonWorkerFactory: Meet EOFException, daemon is alive: false 19/10/28 16:15:03 ERROR PythonWorkerFactory: Exit value: 139 19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: No port number in pyspark.daemon's stdout at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:206) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) at
[jira] [Created] (SPARK-29619) Improve the exception message when reading the daemon port and add retry times.
jiaan.geng created SPARK-29619: -- Summary: Improve the exception message when reading the daemon port and add retry times. Key: SPARK-29619 URL: https://issues.apache.org/jira/browse/SPARK-29619 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: jiaan.geng In production environment, my pyspark application occurs an exception and it's message as below: {code:java} 19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: No port number in pyspark.daemon's stdout at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745){code} At first, I think a physical node has many ports are occupied by a large number of processes. But I found the total number of ports in use is only 671. {{}} {code:java} {code} {{[yarn@r1115 ~]$ netstat -a | wc -l 671}} I checked the code of PythonWorkerFactory in line 204 and found: {code:java} daemon = pb.start() val in = new DataInputStream(daemon.getInputStream) try { daemonPort = in.readInt() } catch { case _: EOFException => throw new SparkException(s"No port number in $daemonModule's stdout") } {code} I added some code here: {code:java} logError("Meet EOFException, daemon is alive: ${daemon.isAlive()}") logError("Exit value: ${daemon.exitValue()}") {code} Then I recurrent the exception and it's message as below: {code:java} 19/10/28 16:15:03 ERROR PythonWorkerFactory: Meet EOFException, daemon is alive: false 19/10/28 16:15:03 ERROR PythonWorkerFactory: Exit value: 139 19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: No port number in pyspark.daemon's stdout at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:206) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182) at
[jira] [Created] (SPARK-29618) remove V1_BATCH_WRITE table capability
Wenchen Fan created SPARK-29618: --- Summary: remove V1_BATCH_WRITE table capability Key: SPARK-29618 URL: https://issues.apache.org/jira/browse/SPARK-29618 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29580) Add kerberos debug messages for Kafka secure tests
[ https://issues.apache.org/jira/browse/SPARK-29580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960869#comment-16960869 ] Gabor Somogyi commented on SPARK-29580: --- [~dongjoon] thanks for taking care, ping me if reproduced and I'll continue the analysis... > Add kerberos debug messages for Kafka secure tests > -- > > Key: SPARK-29580 > URL: https://issues.apache.org/jira/browse/SPARK-29580 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.0.0 > > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112562/testReport/org.apache.spark.sql.kafka010/KafkaDelegationTokenSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ > {code} > sbt.ForkMain$ForkError: org.apache.kafka.common.KafkaException: Failed to > create new KafkaAdminClient > at > org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:407) > at > org.apache.kafka.clients.admin.AdminClient.create(AdminClient.java:55) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedKafkaServer(KafkaTestUtils.scala:227) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:249) > at > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) > at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) > at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: sbt.ForkMain$ForkError: org.apache.kafka.common.KafkaException: > javax.security.auth.login.LoginException: Server not found in Kerberos > database (7) - Server not found in Kerberos database > at > org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:160) > at > org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:146) > at > org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:67) > at > org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:99) > at > org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:382) > ... 16 more > Caused by: sbt.ForkMain$ForkError: javax.security.auth.login.LoginException: > Server not found in Kerberos database (7) - Server not found in Kerberos > database > at > com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:804) > at > com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) > at > javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) > at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) > at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) > at java.security.AccessController.doPrivileged(Native Method) > at > javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) > at javax.security.auth.login.LoginContext.login(LoginContext.java:587) > at > org.apache.kafka.common.security.authenticator.AbstractLogin.login(AbstractLogin.java:60) > at > org.apache.kafka.common.security.kerberos.KerberosLogin.login(KerberosLogin.java:103) > at > org.apache.kafka.common.security.authenticator.LoginManager.(LoginManager.java:61) > at > org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:104) > at >