[jira] [Resolved] (SPARK-29612) ALTER TABLE (RECOVER PARTITIONS) should look up catalog/table like v2 commands

2019-10-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29612.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26269
[https://github.com/apache/spark/pull/26269]

> ALTER TABLE (RECOVER PARTITIONS) should look up catalog/table like v2 commands
> --
>
> Key: SPARK-29612
> URL: https://issues.apache.org/jira/browse/SPARK-29612
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> ALTER TABLE (RECOVER PARTITIONS) should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29612) ALTER TABLE (RECOVER PARTITIONS) should look up catalog/table like v2 commands

2019-10-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29612:
---

Assignee: Huaxin Gao

> ALTER TABLE (RECOVER PARTITIONS) should look up catalog/table like v2 commands
> --
>
> Key: SPARK-29612
> URL: https://issues.apache.org/jira/browse/SPARK-29612
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> ALTER TABLE (RECOVER PARTITIONS) should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29573) Spark should work as PostgreSQL when using + Operator

2019-10-28 Thread ABHISHEK KUMAR GUPTA (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK KUMAR GUPTA updated SPARK-29573:
-
Description: 
Spark and PostgreSQL result is different when concatenating as below :

{code}
Spark : Giving NULL result 
0: jdbc:hive2://10.18.19.208:23040/default> select * from emp12;
+-+-+
| id | name |
+-+-+
| 20 | test |
| 10 | number |
+-+-+
2 rows selected (3.683 seconds)
0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as address 
from emp12;
+-+--+
| ID | address |
+-+--+
| 20 | NULL |
| 10 | NULL |
+-+--+
2 rows selected (0.649 seconds)
0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as address 
from emp12;
+-+--+
| ID | address |
+-+--+
| 20 | NULL |
| 10 | NULL |
+-+--+
2 rows selected (0.406 seconds)
0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+','+name as 
address from emp12;
+-+--+
| ID | address |
+-+--+
| 20 | NULL |
| 10 | NULL |
+-+--+
PostgreSQL: Saying throwing Error saying not supported
create table emp12(id int,name varchar(255));
insert into emp12 values(10,'number');
insert into emp12 values(20,'test');
select id as ID, id+','+name as address from emp12;

output: invalid input syntax for integer: ","

create table emp12(id int,name varchar(255));
insert into emp12 values(10,'number');
insert into emp12 values(20,'test');
select id as ID, id+name as address from emp12;

Output: 42883: operator does not exist: integer + character varying
{code}




 

  was:
Spark and PostgreSQL result is different when concatenating as below :

Spark : Giving NULL result 

{code}
0: jdbc:hive2://10.18.19.208:23040/default> select * from emp12;
+-+-+
| id | name |
+-+-+
| 20 | test |
| 10 | number |
+-+-+
2 rows selected (3.683 seconds)

0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as address 
from emp12;
+-+--+
| ID | address |
+-+--+
| 20 | NULL |
| 10 | NULL |
+-+--+
2 rows selected (0.649 seconds)

0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as address 
from emp12;
+-+--+
| ID | address |
+-+--+
| 20 | NULL |
| 10 | NULL |
+-+--+
2 rows selected (0.406 seconds)

0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+','+name as 
address from emp12;
+-+--+
| ID | address |
+-+--+
| 20 | NULL |
| 10 | NULL |
+-+--+
{code}

 

PostgreSQL: Saying throwing Error saying not supported

{code}
create table emp12(id int,name varchar(255));
insert into emp12 values(10,'number');
insert into emp12 values(20,'test');
select id as ID, id+','+name as address from emp12;

output: invalid input syntax for integer: ","

create table emp12(id int,name varchar(255));
insert into emp12 values(10,'number');
insert into emp12 values(20,'test');
select id as ID, id+name as address from emp12;

Output: 42883: operator does not exist: integer + character varying
{code}

 

It should throw Error in Spark if it is not supported.

 


> Spark should work as PostgreSQL when using + Operator
> -
>
> Key: SPARK-29573
> URL: https://issues.apache.org/jira/browse/SPARK-29573
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Spark and PostgreSQL result is different when concatenating as below :
> {code}
> Spark : Giving NULL result 
> 0: jdbc:hive2://10.18.19.208:23040/default> select * from emp12;
> +-+-+
> | id | name |
> +-+-+
> | 20 | test |
> | 10 | number |
> +-+-+
> 2 rows selected (3.683 seconds)
> 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as 
> address from emp12;
> +-+--+
> | ID | address |
> +-+--+
> | 20 | NULL |
> | 10 | NULL |
> +-+--+
> 2 rows selected (0.649 seconds)
> 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as 
> address from emp12;
> +-+--+
> | ID | address |
> +-+--+
> | 20 | NULL |
> | 10 | NULL |
> +-+--+
> 2 rows selected (0.406 seconds)
> 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+','+name as 
> address from emp12;
> +-+--+
> | ID | address |
> +-+--+
> | 20 | NULL |
> | 10 | NULL |
> +-+--+
> PostgreSQL: Saying throwing Error saying not supported
> create table emp12(id int,name varchar(255));
> insert into emp12 values(10,'number');
> insert into emp12 values(20,'test');
> select id as ID, id+','+name as address from emp12;
> output: invalid input 

[jira] [Commented] (SPARK-24918) Executor Plugin API

2019-10-28 Thread Brandon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961672#comment-16961672
 ] 

Brandon commented on SPARK-24918:
-

[~nsheth] placing the plugin class inside a jar and passing as `–jars` to 
spark-submit should sufficient, right? It seems this is not enough to make the 
class visible to the executor. I have had to explicitly add this jar to 
`spark.executor.extraClassPath` for plugins to load correctly.

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Nihar Sheth
>Priority: Major
>  Labels: SPIP, memory-analysis
> Fix For: 2.4.0
>
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29601) JDBC ODBC Tab Statement column should be provided ellipsis for bigger SQL statement

2019-10-28 Thread ABHISHEK KUMAR GUPTA (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK KUMAR GUPTA updated SPARK-29601:
-
Description: 
Statement column in JDBC/ODBC gives whole SQL statement and page size increases.

Suppose user submit and TPCDS Queries, then Page it display whole Query under 
statement and User Experience is not good.

Expected:

It should display the ... Ellipsis and on clicking the stmt. it should Expand 
display the whole SQL Statement.

 

  was:
Statement column in JDBC/ODBC gives whole SQL statement and page size increases.

Suppose user submit and TPCDS Queries, then Page it display whole Query under 
statement and User Experience is not good.

Expected:

It should display the ... Ellipsis and on clicking the stmt. it should Expand 
display the whole SQL Statement.

e.g.

SELECT * FROM (SELECT count(*) h8_30_to_9 FROM store_sales, 
household_demographics, time_dim, store WHERE ss_sold_time_sk = 
time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND 
ss_store_sk = s_store_sk AND time_dim.t_hour = 8 AND time_dim.t_minute >= 30 
AND ((household_demographics.hd_dep_count = 4 AND 
household_demographics.hd_vehicle_count <= 4 + 2) OR 
(household_demographics.hd_dep_count = 2 AND 
household_demographics.hd_vehicle_count <= 2 + 2) OR 
(household_demographics.hd_dep_count = 0 AND 
household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = 
'ese') s1, (SELECT count(*) h9_to_9_30 FROM store_sales, 
household_demographics, time_dim, store WHERE ss_sold_time_sk = 
time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND 
ss_store_sk = s_store_sk AND time_dim.t_hour = 9 AND time_dim.t_minute < 30 AND 
((household_demographics.hd_dep_count = 4 AND 
household_demographics.hd_vehicle_count <= 4 + 2) OR 
(household_demographics.hd_dep_count = 2 AND 
household_demographics.hd_vehicle_count <= 2 + 2) OR 
(household_demographics.hd_dep_count = 0 AND 
household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = 
'ese') s2, (SELECT count(*) h9_30_to_10 FROM store_sales, 
household_demographics, time_dim, store WHERE ss_sold_time_sk = 
time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND 
ss_store_sk = s_store_sk AND time_dim.t_hour = 9 AND time_dim.t_minute >= 30 
AND ((household_demographics.hd_dep_count = 4 AND 
household_demographics.hd_vehicle_count <= 4 + 2) OR 
(household_demographics.hd_dep_count = 2 AND 
household_demographics.hd_vehicle_count <= 2 + 2) OR 
(household_demographics.hd_dep_count = 0 AND 
household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = 
'ese') s3,(SELECT count(*) h10_to_10_30 FROM store_sales, 
household_demographics, time_dim, store WHERE ss_sold_time_sk = 
time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND 
ss_store_sk = s_store_sk AND time_dim.t_hour = 10 AND time_dim.t_minute < 30 
AND ((household_demographics.hd_dep_count = 4 AND 
household_demographics.hd_vehicle_count <= 4 + 2) OR 
(household_demographics.hd_dep_count = 2 AND 
household_demographics.hd_vehicle_count <= 2 + 2) OR 
(household_demographics.hd_dep_count = 0 AND 
household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = 
'ese') s4,(SELECT count(*) h10_30_to_11 FROM 
store_sales,household_demographics, time_dim, store WHERE ss_sold_time_sk = 
time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND 
ss_store_sk = s_store_sk AND time_dim.t_hour = 10 AND time_dim.t_minute >= 30 
AND ((household_demographics.hd_dep_count = 4 AND 
household_demographics.hd_vehicle_count <= 4 + 2) OR 
(household_demographics.hd_dep_count = 2 AND 
household_demographics.hd_vehicle_count <= 2 + 2) OR 
(household_demographics.hd_dep_count = 0 AND 
household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = 
'ese') s5,(SELECT count(*) h11_to_11_30 FROM store_sales, 
household_demographics, time_dim, store WHERE ss_sold_time_sk = 
time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND 
ss_store_sk = s_store_sk AND time_dim.t_hour = 11 AND time_dim.t_minute < 30 
AND ((household_demographics.hd_dep_count = 4 AND 
household_demographics.hd_vehicle_count <= 4 + 2) OR 
(household_demographics.hd_dep_count = 2 AND 
household_demographics.hd_vehicle_count <= 2 + 2) OR 
(household_demographics.hd_dep_count = 0 AND 
household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = 
'ese') s6,(SELECT count(*) h11_30_to_12 FROM store_sales, 
household_demographics, time_dim, store WHERE ss_sold_time_sk = 
time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND 
ss_store_sk = s_store_sk AND time_dim.t_hour = 11 AND time_dim.t_minute >= 30 
AND ((household_demographics.hd_dep_count = 4 AND 
household_demographics.hd_vehicle_count <= 4 + 2) OR 
(household_demographics.hd_dep_count = 2 AND 

[jira] [Updated] (SPARK-29600) array_contains built in function is not backward compatible in 3.0

2019-10-28 Thread ABHISHEK KUMAR GUPTA (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK KUMAR GUPTA updated SPARK-29600:
-
Description: 
SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); throws 
Exception in 3.0 where as in 2.3.2 is working fine.

Spark 3.0 output:

0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2);
 Error: org.apache.spark.sql.AnalysisException: cannot resolve 
'array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1BD AS DECIMAL(13,3)), 
CAST(0.2BD AS DECIMAL(13,3)), CAST(0.3BD AS DECIMAL(13,3)), CAST(0.5BD AS 
DECIMAL(13,3)), CAST(0.02BD AS DECIMAL(13,3)), CAST(0.033BD AS DECIMAL(13,3))), 
0.2BD)' due to data type mismatch: Input to function array_contains should have 
been array followed by a value with same element type, but it's 
[array, decimal(1,1)].; line 1 pos 7;
 'Project [unresolvedalias(array_contains(array(cast(0 as decimal(13,3)), 
cast(0.1 as decimal(13,3)), cast(0.2 as decimal(13,3)), cast(0.3 as 
decimal(13,3)), cast(0.5 as decimal(13,3)), cast(0.02 as decimal(13,3)), 
cast(0.033 as decimal(13,3))), 0.2), None)]

Spark 2.3.2 output

0: jdbc:hive2://10.18.18.214:23040/default> SELECT 
array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2);
|array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1 AS DECIMAL(13,3)), 
CAST(0.2 AS DECIMAL(13,3)), CAST(0.3 AS DECIMAL(13,3)), CAST(0.5 AS 
DECIMAL(13,3)), CAST(0.02 AS DECIMAL(13,3)), CAST(0.033 AS DECIMAL(13,3))), 
CAST(0.2 AS DECIMAL(13,3)))|
|true|

1 row selected (0.18 seconds)

 

 

  was:
SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); throws 
Exception in 3.0 where as in 2.3.2 is working fine.

Spark 3.0 output:


0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2);
Error: org.apache.spark.sql.AnalysisException: cannot resolve 
'array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1BD AS DECIMAL(13,3)), 
CAST(0.2BD AS DECIMAL(13,3)), CAST(0.3BD AS DECIMAL(13,3)), CAST(0.5BD AS 
DECIMAL(13,3)), CAST(0.02BD AS DECIMAL(13,3)), CAST(0.033BD AS DECIMAL(13,3))), 
0.2BD)' due to data type mismatch: Input to function array_contains should have 
been array followed by a value with same element type, but it's 
[array, decimal(1,1)].; line 1 pos 7;
'Project [unresolvedalias(array_contains(array(cast(0 as decimal(13,3)), 
cast(0.1 as decimal(13,3)), cast(0.2 as decimal(13,3)), cast(0.3 as 
decimal(13,3)), cast(0.5 as decimal(13,3)), cast(0.02 as decimal(13,3)), 
cast(0.033 as decimal(13,3))), 0.2), None)]
+- OneRowRelation (state=,code=0)
0: jdbc:hive2://10.18.19.208:23040/default> ne 1 pos 7;
Error: org.apache.spark.sql.catalyst.parser.ParseException:

 


mismatched input 'ne' expecting \{'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 
'CLEAR', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 
'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 
'LOCK', 'MAP', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 
'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 
'UNLOCK', 'USE', 'VALUES', 'WITH'}(line 1, pos 0)

== SQL ==
ne 1 pos 7
^^^ (state=,code=0)

 

Spark 2.3.2 output

0: jdbc:hive2://10.18.18.214:23040/default> SELECT 
array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2);
+-+--+
| array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1 AS DECIMAL(13,3)), 
CAST(0.2 AS DECIMAL(13,3)), CAST(0.3 AS DECIMAL(13,3)), CAST(0.5 AS 
DECIMAL(13,3)), CAST(0.02 AS DECIMAL(13,3)), CAST(0.033 AS DECIMAL(13,3))), 
CAST(0.2 AS DECIMAL(13,3))) |
+-+--+
| true |
+-+--+
1 row selected (0.18 seconds)

 

 


> array_contains built in function is not backward compatible in 3.0
> --
>
> Key: SPARK-29600
> URL: https://issues.apache.org/jira/browse/SPARK-29600
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); throws 
> 

[jira] [Commented] (SPARK-29477) Improve tooltip information for Streaming Tab

2019-10-28 Thread ABHISHEK KUMAR GUPTA (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961662#comment-16961662
 ] 

ABHISHEK KUMAR GUPTA commented on SPARK-29477:
--

[~hyukjin.kwon]: When user submit streaming job, Streaming Tab will display 
under WEB UI,

This Tab will have Streaming Statistics in terms of Graph.

Down there will be 2 Table Active Batches and Completed Batches.

In Active Batches table columns like Batch Time, Records tool tips can provided.

Same in Completed batch table column

 

 

> Improve tooltip information for Streaming  Tab
> --
>
> Key: SPARK-29477
> URL: https://issues.apache.org/jira/browse/SPARK-29477
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Active Batches Table and Completed Batches can be re look and put proper tip 
> for Batch Time and Record column



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29222) Flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_parameter_convergence

2019-10-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961636#comment-16961636
 ] 

Hyukjin Kwon commented on SPARK-29222:
--

Can you make a PR to increase the timeout?

> Flaky test: 
> pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_parameter_convergence
> ---
>
> Key: SPARK-29222
> URL: https://issues.apache.org/jira/browse/SPARK-29222
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, Tests
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111237/testReport/]
> {code:java}
> Error Message
> 7 != 10
> StacktraceTraceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 429, in test_parameter_convergence
> self._eventually(condition, catch_assertions=True)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 74, in _eventually
> raise lastValue
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 65, in _eventually
> lastValue = condition()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 425, in condition
> self.assertEqual(len(model_weights), len(batches))
> AssertionError: 7 != 10
>{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29627) array_contains should allow column instances in PySpark

2019-10-28 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-29627:


 Summary: array_contains should allow column instances in PySpark
 Key: SPARK-29627
 URL: https://issues.apache.org/jira/browse/SPARK-29627
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


Scala API works well with column instances:

{code}
import org.apache.spark.sql.functions._
val df = Seq(Array("a", "b", "c"), Array.empty[String]).toDF("data")
df.select(array_contains($"data", lit("a"))).collect()
{code}

{code}
Array[org.apache.spark.sql.Row] = Array([true], [false])
{code}

However, seems PySpark one doesn't:

{code}
from pyspark.sql.functions import array_contains, lit
df = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])
df.select(array_contains(df.data, lit("a"))).show()
{code}

{code}
Traceback (most recent call last):
 File "", line 1, in 
 File "/.../spark/python/pyspark/sql/functions.py", line 1950, in array_contains
 return Column(sc._jvm.functions.array_contains(_to_java_column(col), value))
 File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 
1277, in __call__
 File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 
1241, in _build_args
 File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 
1228, in _get_args
 File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_collections.py", 
line 500, in convert
 File "/.../spark/python/pyspark/sql/column.py", line 344, in __iter__
 raise TypeError("Column is not iterable")
TypeError: Column is not iterable
{code}

We should let it allow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29626) notEqual() should return true when the one is null, the other is not null

2019-10-28 Thread zhouhuazheng (Jira)
zhouhuazheng created SPARK-29626:


 Summary: notEqual() should return true when the one is null, the 
other is not null
 Key: SPARK-29626
 URL: https://issues.apache.org/jira/browse/SPARK-29626
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 2.4.4
Reporter: zhouhuazheng


the one is null,the other is not null, then use the function notEqual(), we 
hope it return true . 

eg: 

scala> df.show()
+--+---+
| age| name|
+--+---+
| null|Michael|
| 30| Andy|
| 19| Justin|
| 35| null|
| 19| Justin|
| null| null|
|Justin| Justin|
| 19| 19|
+--+---+


scala> df.filter(col("age").notEqual(col("name"))).show
+---+--+
|age| name|
+---+--+
| 30| Andy|
| 19|Justin|
| 19|Justin|
+---+--+


scala> df.filter(col("age").equalTo(col("name"))).show
+--+--+
| age| name|
+--+--+
| null| null|
|Justin|Justin|
| 19| 19|
+--+--+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29566) Imputer should support single-column input/ouput

2019-10-28 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-29566:


Assignee: Huaxin Gao

> Imputer should support single-column input/ouput
> 
>
> Key: SPARK-29566
> URL: https://issues.apache.org/jira/browse/SPARK-29566
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
>
> Imputer should support single-column input/ouput
> refer to https://issues.apache.org/jira/browse/SPARK-29565



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29566) Imputer should support single-column input/ouput

2019-10-28 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-29566.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26247
[https://github.com/apache/spark/pull/26247]

> Imputer should support single-column input/ouput
> 
>
> Key: SPARK-29566
> URL: https://issues.apache.org/jira/browse/SPARK-29566
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> Imputer should support single-column input/ouput
> refer to https://issues.apache.org/jira/browse/SPARK-29565



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29549) Union of DataSourceV2 datasources leads to duplicated results

2019-10-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29549.
--
Resolution: Cannot Reproduce

> Union of DataSourceV2 datasources leads to duplicated results
> -
>
> Key: SPARK-29549
> URL: https://issues.apache.org/jira/browse/SPARK-29549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4
>Reporter: Miguel Molina
>Priority: Major
>
> Hello!
> I've discovered that when two DataSourceV2 data frames in a query of the 
> exact same shape are joined and there is an aggregation in the query, only 
> the first results are used. The rest get removed by the ReuseExchange rule 
> and reuse the results of the first data frame, leading to N times the first 
> data frame results.
>  
> I've put together a repository with an example project where this can be 
> reproduced: [https://github.com/erizocosmico/spark-union-issue]
>  
> Basically, doing this:
>  
> {code:java}
> val products = spark.sql("SELECT name, COUNT(*) as count FROM products GROUP 
> BY name")
> val users = spark.sql("SELECT name, COUNT(*) as count FROM users GROUP BY 
> name")
> products.union(users)
>  .select("name")
>  .show(truncate = false, numRows = 50){code}
>  
>  
> Where products is:
> {noformat}
> +-+---+
> |name |id |
> +-+---+
> |candy |1 |
> |chocolate|2 |
> |milk |3 |
> |cinnamon |4 |
> |pizza |5 |
> |pineapple|6 |
> +-+---+{noformat}
> And users is:
> {noformat}
> +---+---+
> |name |id |
> +---+---+
> |andy |1 |
> |alice |2 |
> |mike |3 |
> |mariah |4 |
> |eleanor|5 |
> |matthew|6 |
> +---+---+ {noformat}
>  
> Results are incorrect:
> {noformat}
> +-+
> |name |
> +-+
> |candy |
> |pizza |
> |chocolate|
> |cinnamon |
> |pineapple|
> |milk |
> |candy |
> |pizza |
> |chocolate|
> |cinnamon |
> |pineapple|
> |milk |
> +-+{noformat}
>  
> This is the plan explained:
>  
> {noformat}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('name, None)]
> +- AnalysisBarrier
>  +- Union
>  :- Aggregate [name#0], [name#0, count(1) AS count#8L]
>  : +- SubqueryAlias products
>  : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], 
> [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6]))
>  +- Aggregate [name#4], [name#4, count(1) AS count#12L]
>  +- SubqueryAlias users
>  +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], 
> [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6]))
> == Analyzed Logical Plan ==
> name: string
> Project [name#0]
> +- Union
>  :- Aggregate [name#0], [name#0, count(1) AS count#8L]
>  : +- SubqueryAlias products
>  : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], 
> [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6]))
>  +- Aggregate [name#4], [name#4, count(1) AS count#12L]
>  +- SubqueryAlias users
>  +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], 
> [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6]))
> == Optimized Logical Plan ==
> Union
> :- Aggregate [name#0], [name#0]
> : +- Project [name#0]
> : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], 
> [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6]))
> +- Aggregate [name#4], [name#4]
>  +- Project [name#4]
>  +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], 
> [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6]))
> == Physical Plan ==
> Union
> :- *(2) HashAggregate(keys=[name#0], functions=[], output=[name#0])
> : +- Exchange hashpartitioning(name#0, 200)
> : +- *(1) HashAggregate(keys=[name#0], functions=[], output=[name#0])
> : +- *(1) Project [name#0]
> : +- *(1) DataSourceV2Scan [name#0, id#1], DefaultReader(List([candy,1], 
> [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6]))
> +- *(4) HashAggregate(keys=[name#4], functions=[], output=[name#4])
>  +- ReusedExchange [name#4], Exchange hashpartitioning(name#0, 200)
> {noformat}
>  
>  
> In the physical plan, the first exchange is reused, but it shouldn't be 
> because both sources are not the same.
>  
> {noformat}
> == Physical Plan ==
> Union
> :- *(2) HashAggregate(keys=[name#0], functions=[], output=[name#0])
> : +- Exchange hashpartitioning(name#0, 200)
> : +- *(1) HashAggregate(keys=[name#0], functions=[], output=[name#0])
> : +- *(1) Project [name#0]
> : +- *(1) DataSourceV2Scan [name#0, id#1], DefaultReader(List([candy,1], 
> [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6]))
> +- *(4) HashAggregate(keys=[name#4], functions=[], output=[name#4])
>  +- ReusedExchange [name#4], Exchange hashpartitioning(name#0, 200){noformat}
>  
> This seems to be fixed in 2.4.x, but affects, 2.3.x versions.



--

[jira] [Commented] (SPARK-29549) Union of DataSourceV2 datasources leads to duplicated results

2019-10-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961596#comment-16961596
 ] 

Hyukjin Kwon commented on SPARK-29549:
--

There will be no more releases in 2.3.x line.

> Union of DataSourceV2 datasources leads to duplicated results
> -
>
> Key: SPARK-29549
> URL: https://issues.apache.org/jira/browse/SPARK-29549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4
>Reporter: Miguel Molina
>Priority: Major
>
> Hello!
> I've discovered that when two DataSourceV2 data frames in a query of the 
> exact same shape are joined and there is an aggregation in the query, only 
> the first results are used. The rest get removed by the ReuseExchange rule 
> and reuse the results of the first data frame, leading to N times the first 
> data frame results.
>  
> I've put together a repository with an example project where this can be 
> reproduced: [https://github.com/erizocosmico/spark-union-issue]
>  
> Basically, doing this:
>  
> {code:java}
> val products = spark.sql("SELECT name, COUNT(*) as count FROM products GROUP 
> BY name")
> val users = spark.sql("SELECT name, COUNT(*) as count FROM users GROUP BY 
> name")
> products.union(users)
>  .select("name")
>  .show(truncate = false, numRows = 50){code}
>  
>  
> Where products is:
> {noformat}
> +-+---+
> |name |id |
> +-+---+
> |candy |1 |
> |chocolate|2 |
> |milk |3 |
> |cinnamon |4 |
> |pizza |5 |
> |pineapple|6 |
> +-+---+{noformat}
> And users is:
> {noformat}
> +---+---+
> |name |id |
> +---+---+
> |andy |1 |
> |alice |2 |
> |mike |3 |
> |mariah |4 |
> |eleanor|5 |
> |matthew|6 |
> +---+---+ {noformat}
>  
> Results are incorrect:
> {noformat}
> +-+
> |name |
> +-+
> |candy |
> |pizza |
> |chocolate|
> |cinnamon |
> |pineapple|
> |milk |
> |candy |
> |pizza |
> |chocolate|
> |cinnamon |
> |pineapple|
> |milk |
> +-+{noformat}
>  
> This is the plan explained:
>  
> {noformat}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('name, None)]
> +- AnalysisBarrier
>  +- Union
>  :- Aggregate [name#0], [name#0, count(1) AS count#8L]
>  : +- SubqueryAlias products
>  : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], 
> [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6]))
>  +- Aggregate [name#4], [name#4, count(1) AS count#12L]
>  +- SubqueryAlias users
>  +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], 
> [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6]))
> == Analyzed Logical Plan ==
> name: string
> Project [name#0]
> +- Union
>  :- Aggregate [name#0], [name#0, count(1) AS count#8L]
>  : +- SubqueryAlias products
>  : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], 
> [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6]))
>  +- Aggregate [name#4], [name#4, count(1) AS count#12L]
>  +- SubqueryAlias users
>  +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], 
> [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6]))
> == Optimized Logical Plan ==
> Union
> :- Aggregate [name#0], [name#0]
> : +- Project [name#0]
> : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], 
> [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6]))
> +- Aggregate [name#4], [name#4]
>  +- Project [name#4]
>  +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], 
> [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6]))
> == Physical Plan ==
> Union
> :- *(2) HashAggregate(keys=[name#0], functions=[], output=[name#0])
> : +- Exchange hashpartitioning(name#0, 200)
> : +- *(1) HashAggregate(keys=[name#0], functions=[], output=[name#0])
> : +- *(1) Project [name#0]
> : +- *(1) DataSourceV2Scan [name#0, id#1], DefaultReader(List([candy,1], 
> [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6]))
> +- *(4) HashAggregate(keys=[name#4], functions=[], output=[name#4])
>  +- ReusedExchange [name#4], Exchange hashpartitioning(name#0, 200)
> {noformat}
>  
>  
> In the physical plan, the first exchange is reused, but it shouldn't be 
> because both sources are not the same.
>  
> {noformat}
> == Physical Plan ==
> Union
> :- *(2) HashAggregate(keys=[name#0], functions=[], output=[name#0])
> : +- Exchange hashpartitioning(name#0, 200)
> : +- *(1) HashAggregate(keys=[name#0], functions=[], output=[name#0])
> : +- *(1) Project [name#0]
> : +- *(1) DataSourceV2Scan [name#0, id#1], DefaultReader(List([candy,1], 
> [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6]))
> +- *(4) HashAggregate(keys=[name#4], functions=[], output=[name#4])
>  +- ReusedExchange [name#4], Exchange hashpartitioning(name#0, 200){noformat}
>  
> This seems to be 

[jira] [Updated] (SPARK-27842) Inconsistent results of Statistics.corr() and PearsonCorrelation.computeCorrelationMatrix()

2019-10-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27842:
-
Affects Version/s: 2.4.3

> Inconsistent results of Statistics.corr() and 
> PearsonCorrelation.computeCorrelationMatrix()
> ---
>
> Key: SPARK-27842
> URL: https://issues.apache.org/jira/browse/SPARK-27842
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core, Windows
>Affects Versions: 2.3.1, 2.4.3
>Reporter: Peter Nijem
>Priority: Major
> Attachments: vectorList.txt
>
>
> Hi,
> I am working with Spark Java API in local mode (1 node, 8 cores). Spark 
> version as follows in my pom.xml:
> *MLLib*
> _spark-mllib_2.11_
>  _2.3.1_
> *Core*
> _spark-core_2.11_
>  _2.3.1_
> I am experiencing inconsistent results of correlation when starting my Spark 
> application with 8 cores vs 1/2/3 cores.
> I've created a Main class which reads from a file a list of Vectors; 240 
> Vector which each Vector is of the length of 226. 
> As you can see, I am firstly initializing Spark with local[*], running 
> Correlation, saving the Matrix and stopping Spark. Then, I do the same but 
> for local[3].
> I am expecting to get the same matrices on both runs. But this is not the 
> case. The input file is attached.
> I tried to compute the correlation using 
> PearsonCorrealtion.computeCorrelationMatrix() but I faced the same issue here 
> as well.
>  
> In my work, I am dependent on the resulting correlation matrix. Thus, I am 
> experiencing bad results in y application due to the inconsistent results I 
> am getting. As a workaround, I am working now with local[1]
>  
>  
> {code:java}
> import org.apache.spark.SparkConf;
> import org.apache.spark.api.java.JavaSparkContext;
> import org.apache.spark.mllib.linalg.DenseVector;
> import org.apache.spark.mllib.linalg.Matrix;
> import org.apache.spark.mllib.linalg.Vector;
> import org.apache.spark.mllib.stat.Statistics;
> import org.apache.spark.rdd.RDD;
> import org.junit.Assert;
> import java.io.BufferedReader;
> import java.io.FileReader;
> import java.io.IOException;
> import java.math.RoundingMode;
> import java.text.DecimalFormat;
> import java.util.ArrayList;
> import java.util.Arrays;
> import java.util.List;
> import java.util.stream.Collectors;
> public class TestSparkCorr {
> private static JavaSparkContext ctx;
> public static void main(String[] args) {
> List> doublesLists = readInputFile();
> List resultVectors = getVectorsList(doublesLists);
> //===
> initSpark("*");
> RDD RDD_vector = ctx.parallelize(resultVectors).rdd();
> Matrix m = Statistics.corr(RDD_vector, "pearson");
> stopSpark();
> //===
> initSpark("3");
> RDD RDD_vector_3 = ctx.parallelize(resultVectors).rdd();
> Matrix m3 = Statistics.corr(RDD_vector_3, "pearson");
> stopSpark();
> //===
> Assert.assertEquals(m3, m);
> }
> private static List getVectorsList(List> doublesLists) {
> List resultVectors = new ArrayList<>(doublesLists.size());
> for (List vector : doublesLists) {
> double[] x = new double[vector.size()];
> for(int i = 0; i < x.length; i++){
> x[i] = vector.get(i);
> }
> resultVectors.add(new DenseVector(x));
> }
> return resultVectors;
> }
> private static List> readInputFile() {
> List> doublesLists = new ArrayList<>();
> try (BufferedReader reader = new BufferedReader(new FileReader(
> ".//output//vectorList.txt"))) {
> String line = reader.readLine();
> while (line != null) {
> String[] splitLine = line.substring(1, line.length() - 2).split(",");
> List doubles = Arrays.stream(splitLine).map(x -> 
> Double.valueOf(x.trim())).collect(Collectors.toList());
> doublesLists.add(doubles);
> // read next line
> line = reader.readLine();
> }
> } catch (IOException e) {
> e.printStackTrace();
> }
> return doublesLists;
> }
> private static void initSpark(String coresNum) {
> final SparkConf sparkConf = new SparkConf().setAppName("Span");
> sparkConf.setMaster(String.format("local[%s]", coresNum));
> sparkConf.set("spark.ui.enabled", "false");
> ctx = new JavaSparkContext(sparkConf);
> }
> private static void stopSpark() {
> ctx.stop();
> if(ctx.sc().isStopped()){
> ctx = null;
> }
> }
> }
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29624) Jenkins fails with "Python versions prior to 2.7 are not supported."

2019-10-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961595#comment-16961595
 ] 

Hyukjin Kwon commented on SPARK-29624:
--

Awesome [~shaneknapp]!

> Jenkins fails with "Python versions prior to 2.7 are not supported."
> 
>
> Key: SPARK-29624
> URL: https://issues.apache.org/jira/browse/SPARK-29624
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Blocker
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112777/console
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112779/console
> {code}
> ...
> + ./dev/run-tests-jenkins
> Python versions prior to 2.7 are not supported.
> Build step 'Execute shell' marked build as failure
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice

2019-10-28 Thread Sandish Kumar HN (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandish Kumar HN updated SPARK-29625:
-
Description: 
Spark Structure Streaming Kafka Reset Offset twice, once with right offsets and 
second time with very old offsets 

[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-151 to offset 0.

[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-118 to offset 0.

[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-85 to offset 0.
 **

*[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
INFO Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-52 to offset 122677634.*

[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-19 to offset 0.
 **

*[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
INFO Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-52 to offset 120504922.*

[2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
ContextCleaner: Cleaned accumulator 810



which is causing a Data loss issue. 



 

{{[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
ERROR StreamExecution: Query [id = d62ca9e4-6650-454f-8691-a3d576d1e4ba, runId 
= 3946389f-222b-495c-9ab2-832c0422cbbb] terminated with error
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 
java.lang.IllegalStateException: Partition topic-52's offset was changed from 
122677598 to 120504922, some data may have been missed.
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - Some data may have 
been lost because they are not available in Kafka any more; either the
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  data was aged out by 
Kafka or the topic may have been deleted before all the data in the
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  topic was processed. 
If you don't want your streaming query to fail on such cases, set the
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  source option 
"failOnDataLoss" to "false".
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
org.apache.spark.sql.kafka010.KafkaSource.org$apache$spark$sql$kafka010$KafkaSource$$reportDataLoss(KafkaSource.scala:329)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:283)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:281)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
scala.collection.AbstractTraversable.filter(Traversable.scala:104)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:281)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$7.apply(StreamExecution.scala:614)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 

[jira] [Updated] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice

2019-10-28 Thread Sandish Kumar HN (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandish Kumar HN updated SPARK-29625:
-
Summary: Spark Structure Streaming Kafka Wrong Reset Offset twice  (was: 
Spark Stucture Streaming Kafka Reset Offset twice)

> Spark Structure Streaming Kafka Wrong Reset Offset twice
> 
>
> Key: SPARK-29625
> URL: https://issues.apache.org/jira/browse/SPARK-29625
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Sandish Kumar HN
>Priority: Major
>
> Spark Stucture Streaming Kafka Reset Offset twice, once with right offsets 
> and second time with very old offsets 
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-151 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-118 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-85 to offset 0.
> **
> *[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-52 to offset 122677634.*
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-19 to offset 0.
> **
> *[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-52 to offset 120504922.*
> [2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO ContextCleaner: Cleaned accumulator 810



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29625) Spark Stucture Streaming Kafka Reset Offset twice

2019-10-28 Thread Sandish Kumar HN (Jira)
Sandish Kumar HN created SPARK-29625:


 Summary: Spark Stucture Streaming Kafka Reset Offset twice
 Key: SPARK-29625
 URL: https://issues.apache.org/jira/browse/SPARK-29625
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Sandish Kumar HN


Spark Stucture Streaming Kafka Reset Offset twice, once with right offsets and 
second time with very old offsets 



[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-151 to offset 0.


[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-118 to offset 0.

[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-85 to offset 0.
**

*[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
INFO Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-52 to offset 122677634.*

[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-19 to offset 0.
**

*[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
INFO Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-52 to offset 120504922.*

[2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
ContextCleaner: Cleaned accumulator 810



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice

2019-10-28 Thread Sandish Kumar HN (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandish Kumar HN updated SPARK-29625:
-
Description: 
Spark Structure Streaming Kafka Reset Offset twice, once with right offsets and 
second time with very old offsets 

[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-151 to offset 0.

[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-118 to offset 0.

[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-85 to offset 0.
 **

*[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
INFO Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-52 to offset 122677634.*

[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-19 to offset 0.
 **

*[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
INFO Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-52 to offset 120504922.*

[2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
ContextCleaner: Cleaned accumulator 810

  was:
Spark Stucture Streaming Kafka Reset Offset twice, once with right offsets and 
second time with very old offsets 



[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-151 to offset 0.


[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-118 to offset 0.

[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-85 to offset 0.
**

*[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
INFO Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-52 to offset 122677634.*

[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-19 to offset 0.
**

*[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
INFO Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-52 to offset 120504922.*

[2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
ContextCleaner: Cleaned accumulator 810


> Spark Structure Streaming Kafka Wrong Reset Offset twice
> 
>
> Key: SPARK-29625
> URL: https://issues.apache.org/jira/browse/SPARK-29625
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Sandish Kumar HN
>Priority: Major
>
> Spark Structure Streaming Kafka Reset Offset twice, once with right offsets 
> and second time with very old offsets 
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-151 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-118 to 

[jira] [Assigned] (SPARK-29609) DataSourceV2: Support DROP NAMESPACE

2019-10-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29609:
-

Assignee: Terry Kim

> DataSourceV2: Support DROP NAMESPACE
> 
>
> Key: SPARK-29609
> URL: https://issues.apache.org/jira/browse/SPARK-29609
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>
> DROP NAMESPACE needs to support v2 catalogs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29609) DataSourceV2: Support DROP NAMESPACE

2019-10-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29609.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26262
[https://github.com/apache/spark/pull/26262]

> DataSourceV2: Support DROP NAMESPACE
> 
>
> Key: SPARK-29609
> URL: https://issues.apache.org/jira/browse/SPARK-29609
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> DROP NAMESPACE needs to support v2 catalogs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28841) Spark cannot read a relative path containing ":"

2019-10-28 Thread Shixiong Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961412#comment-16961412
 ] 

Shixiong Zhu commented on SPARK-28841:
--

[~srowen] Yep. ":" is not a valid char in a HDFS path. But it's valid for some 
other file systems, such as local file system, or S3AFileSystem, and they allow 
the user to create such files. I pointed out the codes that hit this issue in 
HDFS-14762.

> Spark cannot read a relative path containing ":"
> 
>
> Key: SPARK-28841
> URL: https://issues.apache.org/jira/browse/SPARK-28841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>
> Reproducer:
> {code}
> spark.read.parquet("test:test")
> {code}
> Error:
> {code}
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: test:test
> {code}
> This is actually a Hadoop issue since the error is thrown from "new 
> Path("test:test")". I'm creating this ticket to see if we can work around 
> this issue in Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28841) Spark cannot read a relative path containing ":"

2019-10-28 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961408#comment-16961408
 ] 

Sean R. Owen commented on SPARK-28841:
--

Oh I see, OK. So is this mostly a Hadoop-side issue?

> Spark cannot read a relative path containing ":"
> 
>
> Key: SPARK-28841
> URL: https://issues.apache.org/jira/browse/SPARK-28841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>
> Reproducer:
> {code}
> spark.read.parquet("test:test")
> {code}
> Error:
> {code}
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: test:test
> {code}
> This is actually a Hadoop issue since the error is thrown from "new 
> Path("test:test")". I'm creating this ticket to see if we can work around 
> this issue in Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27592) Set the bucketed data source table SerDe correctly

2019-10-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961404#comment-16961404
 ] 

Dongjoon Hyun edited comment on SPARK-27592 at 10/28/19 7:44 PM:
-

Ping [~Patnaik] since you asked about this on SPARK-29234. Originally, this was 
registered as an improvement PR, so we didn't backport this to the older 
branches. However, given the situation, I'm also fine for [~yumwang] to 
backport these since he is the author of this.

BTW, please don't expect a backporting to EOL branches like `branch-2.3`.
 - [https://spark.apache.org/versioning-policy.html]
{quote}Feature release branches will, generally, be maintained with bug fix 
releases for a period of 18 months. For example, branch 2.3.x is no longer 
considered maintained as of September 2019, 18 months after the release of 
2.3.0 in February 2018. No more 2.3.x releases should be expected after that 
point, even for bug fixes.
{quote}


was (Author: dongjoon):
Ping [~Patnaik] since you asked about this on SPARK-29234. Originally, this is 
registered as an improvement PR, so we don't backport this to the older 
branches. However, given the situation, I'm also fine for [~yumwang] to 
backporting these since he is the author of this.

BTW, please don't expect a backporting to EOL branches like `branch-2.3`.
 - [https://spark.apache.org/versioning-policy.html]
{quote}Feature release branches will, generally, be maintained with bug fix 
releases for a period of 18 months. For example, branch 2.3.x is no longer 
considered maintained as of September 2019, 18 months after the release of 
2.3.0 in February 2018. No more 2.3.x releases should be expected after that 
point, even for bug fixes.
{quote}

> Set the bucketed data source table SerDe correctly
> --
>
> Key: SPARK-27592
> URL: https://issues.apache.org/jira/browse/SPARK-27592
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> We hint Hive using incorrect 
> InputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat) to read Spark's 
> Parquet datasource bucket table:
> {noformat}
> spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) 
> SORTED BY (c1) INTO 2 BUCKETS;
>  2019-04-29 17:52:05 WARN HiveExternalCatalog:66 - Persisting bucketed data 
> source table `default`.`t` into Hive metastore in Spark SQL specific format, 
> which is NOT compatible with Hive.
>  spark-sql> DESC EXTENDED t;
>  c1 int NULL
>  c2 int NULL
>  # Detailed Table Information
>  Database default
>  Table t
>  Owner yumwang
>  Created Time Mon Apr 29 17:52:05 CST 2019
>  Last Access Thu Jan 01 08:00:00 CST 1970
>  Created By Spark 2.4.0
>  Type MANAGED
>  Provider parquet
>  Num Buckets 2
>  Bucket Columns [`c1`]
>  Sort Columns [`c1`]
>  Table Properties [transient_lastDdlTime=1556531525]
>  Location [file:/user/hive/warehouse/t|file:///user/hive/warehouse/t]
>  Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>  InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
>  OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  Storage Properties [serialization.format=1]
> {noformat}
>  We can see incompatible information when creating the table:
> {noformat}
> WARN HiveExternalCatalog:66 - Persisting bucketed data source table 
> `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT 
> compatible with Hive.
> {noformat}
>  But downstream don’t know the compatibility. I'd like to write the write 
> information of this table to metadata so that each engine decides 
> compatibility itself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27592) Set the bucketed data source table SerDe correctly

2019-10-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961404#comment-16961404
 ] 

Dongjoon Hyun commented on SPARK-27592:
---

Ping [~Patnaik] since you asked about this on SPARK-29234. Originally, this is 
registered as an improvement PR, so we don't backport this to the older 
branches. However, given the situation, I'm also fine for [~yumwang] to 
backporting these since he is the author of this.

BTW, please don't expect a backporting to EOL branches like `branch-2.3`.
 - [https://spark.apache.org/versioning-policy.html]
{quote}Feature release branches will, generally, be maintained with bug fix 
releases for a period of 18 months. For example, branch 2.3.x is no longer 
considered maintained as of September 2019, 18 months after the release of 
2.3.0 in February 2018. No more 2.3.x releases should be expected after that 
point, even for bug fixes.
{quote}

> Set the bucketed data source table SerDe correctly
> --
>
> Key: SPARK-27592
> URL: https://issues.apache.org/jira/browse/SPARK-27592
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> We hint Hive using incorrect 
> InputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat) to read Spark's 
> Parquet datasource bucket table:
> {noformat}
> spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) 
> SORTED BY (c1) INTO 2 BUCKETS;
>  2019-04-29 17:52:05 WARN HiveExternalCatalog:66 - Persisting bucketed data 
> source table `default`.`t` into Hive metastore in Spark SQL specific format, 
> which is NOT compatible with Hive.
>  spark-sql> DESC EXTENDED t;
>  c1 int NULL
>  c2 int NULL
>  # Detailed Table Information
>  Database default
>  Table t
>  Owner yumwang
>  Created Time Mon Apr 29 17:52:05 CST 2019
>  Last Access Thu Jan 01 08:00:00 CST 1970
>  Created By Spark 2.4.0
>  Type MANAGED
>  Provider parquet
>  Num Buckets 2
>  Bucket Columns [`c1`]
>  Sort Columns [`c1`]
>  Table Properties [transient_lastDdlTime=1556531525]
>  Location [file:/user/hive/warehouse/t|file:///user/hive/warehouse/t]
>  Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>  InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
>  OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  Storage Properties [serialization.format=1]
> {noformat}
>  We can see incompatible information when creating the table:
> {noformat}
> WARN HiveExternalCatalog:66 - Persisting bucketed data source table 
> `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT 
> compatible with Hive.
> {noformat}
>  But downstream don’t know the compatibility. I'd like to write the write 
> information of this table to metadata so that each engine decides 
> compatibility itself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28841) Spark cannot read a relative path containing ":"

2019-10-28 Thread Shixiong Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961402#comment-16961402
 ] 

Shixiong Zhu commented on SPARK-28841:
--

[~srowen] "?" is to trigger glob pattern codes. It matches any chars. That's 
not a typo. The code I post here is trying to read files that match the pattern 
"/tmp/test/foo?bar". But if "/tmp/test/foo:bar" exists, it will fail because of 
HDFS-14762.

> Spark cannot read a relative path containing ":"
> 
>
> Key: SPARK-28841
> URL: https://issues.apache.org/jira/browse/SPARK-28841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>
> Reproducer:
> {code}
> spark.read.parquet("test:test")
> {code}
> Error:
> {code}
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: test:test
> {code}
> This is actually a Hadoop issue since the error is thrown from "new 
> Path("test:test")". I'm creating this ticket to see if we can work around 
> this issue in Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format

2019-10-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961395#comment-16961395
 ] 

Dongjoon Hyun commented on SPARK-29234:
---

[~Patnaik]. As you see, this is closed as `Duplicated`.  You had better ping us 
on SPARK-27592. :)

> bucketed table created by Spark SQL DataFrame is in SequenceFile format
> ---
>
> Key: SPARK-29234
> URL: https://issues.apache.org/jira/browse/SPARK-29234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>
> When we create a bucketed table as follows, it's input and output format are 
> getting displayed as SequenceFile format. But physically the files are 
> getting created in HDFS as the format specified by the user e.g. 
> orc,parquet,etc.
> df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")
> in Hive, DESCRIBE FORMATTED OrdersExample;
> describe formatted ordersExample;
> OK
> # col_name  data_type   comment
> col array   from deserializer
> # Storage Information
> SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Querying the same table in Hive is giving error.
> select * from OrdersExample;
> OK
> Failed with exception java.io.IOException:java.io.IOException: 
> hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
>  not a SequenceFile



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29624) Jenkins fails with "Python versions prior to 2.7 are not supported."

2019-10-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29624:
-

Assignee: Shane Knapp

> Jenkins fails with "Python versions prior to 2.7 are not supported."
> 
>
> Key: SPARK-29624
> URL: https://issues.apache.org/jira/browse/SPARK-29624
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Blocker
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112777/console
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112779/console
> {code}
> ...
> + ./dev/run-tests-jenkins
> Python versions prior to 2.7 are not supported.
> Build step 'Execute shell' marked build as failure
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29624) Jenkins fails with "Python versions prior to 2.7 are not supported."

2019-10-28 Thread Shane Knapp (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp resolved SPARK-29624.
-
Resolution: Fixed

> Jenkins fails with "Python versions prior to 2.7 are not supported."
> 
>
> Key: SPARK-29624
> URL: https://issues.apache.org/jira/browse/SPARK-29624
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112777/console
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112779/console
> {code}
> ...
> + ./dev/run-tests-jenkins
> Python versions prior to 2.7 are not supported.
> Build step 'Execute shell' marked build as failure
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29624) Jenkins fails with "Python versions prior to 2.7 are not supported."

2019-10-28 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961377#comment-16961377
 ] 

Shane Knapp commented on SPARK-29624:
-

alright, a pull request build has successfully made it past this check:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112790/

resolving now.

> Jenkins fails with "Python versions prior to 2.7 are not supported."
> 
>
> Key: SPARK-29624
> URL: https://issues.apache.org/jira/browse/SPARK-29624
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112777/console
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112779/console
> {code}
> ...
> + ./dev/run-tests-jenkins
> Python versions prior to 2.7 are not supported.
> Build step 'Execute shell' marked build as failure
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29624) Jenkins fails with "Python versions prior to 2.7 are not supported."

2019-10-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961369#comment-16961369
 ] 

Dongjoon Hyun commented on SPARK-29624:
---

Thank you so much!

> Jenkins fails with "Python versions prior to 2.7 are not supported."
> 
>
> Key: SPARK-29624
> URL: https://issues.apache.org/jira/browse/SPARK-29624
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112777/console
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112779/console
> {code}
> ...
> + ./dev/run-tests-jenkins
> Python versions prior to 2.7 are not supported.
> Build step 'Execute shell' marked build as failure
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29509) Deduplicate code blocks in Kafka data source

2019-10-28 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-29509.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26158
[https://github.com/apache/spark/pull/26158]

> Deduplicate code blocks in Kafka data source
> 
>
> Key: SPARK-29509
> URL: https://issues.apache.org/jira/browse/SPARK-29509
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 3.0.0
>
>
> There're bunch of methods in Kafka data source which have repeated lines in a 
> method - especially they're tied to the number of fields in writer schema, so 
> once we add a new field redundant code lines will be increased. This issue 
> tracks the efforts to deduplicate them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29509) Deduplicate code blocks in Kafka data source

2019-10-28 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-29509:
--

Assignee: Jungtaek Lim

> Deduplicate code blocks in Kafka data source
> 
>
> Key: SPARK-29509
> URL: https://issues.apache.org/jira/browse/SPARK-29509
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
>
> There're bunch of methods in Kafka data source which have repeated lines in a 
> method - especially they're tied to the number of fields in writer schema, so 
> once we add a new field redundant code lines will be increased. This issue 
> tracks the efforts to deduplicate them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29624) Jenkins fails with "Python versions prior to 2.7 are not supported."

2019-10-28 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961320#comment-16961320
 ] 

Shane Knapp commented on SPARK-29624:
-

alright, i triggered a job and checked the console, and the restart seemed to 
fix the PATH variable.

both of these above builds failed on amp-jenkins-worker-03, so i'll keep an eye 
on that worker.

> Jenkins fails with "Python versions prior to 2.7 are not supported."
> 
>
> Key: SPARK-29624
> URL: https://issues.apache.org/jira/browse/SPARK-29624
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112777/console
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112779/console
> {code}
> ...
> + ./dev/run-tests-jenkins
> Python versions prior to 2.7 are not supported.
> Build step 'Execute shell' marked build as failure
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29624) Jenkins fails with "Python versions prior to 2.7 are not supported."

2019-10-28 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961290#comment-16961290
 ] 

Shane Knapp commented on SPARK-29624:
-

nothing changed...  the PATH env vars for each worker got borked during the 
downtime.

i'll need to restart jenkins, and will send a note to the list about this.

> Jenkins fails with "Python versions prior to 2.7 are not supported."
> 
>
> Key: SPARK-29624
> URL: https://issues.apache.org/jira/browse/SPARK-29624
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112777/console
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112779/console
> {code}
> ...
> + ./dev/run-tests-jenkins
> Python versions prior to 2.7 are not supported.
> Build step 'Execute shell' marked build as failure
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29624) Jenkins fails with "Python versions prior to 2.7 are not supported."

2019-10-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961287#comment-16961287
 ] 

Dongjoon Hyun commented on SPARK-29624:
---

Hi, [~shaneknapp].
Jenkins server seems to be changed. Could you take a look at this issue?
cc [~cloud_fan] and [~hyukjin.kwon]

> Jenkins fails with "Python versions prior to 2.7 are not supported."
> 
>
> Key: SPARK-29624
> URL: https://issues.apache.org/jira/browse/SPARK-29624
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112777/console
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112779/console
> {code}
> ...
> + ./dev/run-tests-jenkins
> Python versions prior to 2.7 are not supported.
> Build step 'Execute shell' marked build as failure
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29624) Jenkins fails with "Python versions prior to 2.7 are not supported."

2019-10-28 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-29624:
-

 Summary: Jenkins fails with "Python versions prior to 2.7 are not 
supported."
 Key: SPARK-29624
 URL: https://issues.apache.org/jira/browse/SPARK-29624
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


- 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112777/console
- 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112779/console

{code}
...
+ ./dev/run-tests-jenkins
Python versions prior to 2.7 are not supported.
Build step 'Execute shell' marked build as failure
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29521) LOAD DATA INTO TABLE should look up catalog/table like v2 commands

2019-10-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29521.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26178
[https://github.com/apache/spark/pull/26178]

> LOAD DATA INTO TABLE should look up catalog/table like v2 commands
> --
>
> Key: SPARK-29521
> URL: https://issues.apache.org/jira/browse/SPARK-29521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> LOAD DATA INTO TABLE should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22000) org.codehaus.commons.compiler.CompileException: toString method is not declared

2019-10-28 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961171#comment-16961171
 ] 

Sean R. Owen commented on SPARK-22000:
--

As indicated, it's fixed in 3.0, not 2.4.x. The change is substantial so is 
hard to back-port, but I'd review a change that includes the three linked PRs 
above in 2.4, if it can be made to work.

> org.codehaus.commons.compiler.CompileException: toString method is not 
> declared
> ---
>
> Key: SPARK-22000
> URL: https://issues.apache.org/jira/browse/SPARK-22000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: testcase.zip
>
>
> the error message say that toString is not declared on "value13" which is 
> "long" type in generated code.
> i think value13 should be Long type.
> ==error message
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 70, Column 32: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 70, Column 32: A method named "toString" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> /* 033 */   private void apply1_2(InternalRow i) {
> /* 034 */
> /* 035 */
> /* 036 */ boolean isNull11 = i.isNullAt(1);
> /* 037 */ UTF8String value11 = isNull11 ? null : (i.getUTF8String(1));
> /* 038 */ boolean isNull10 = true;
> /* 039 */ java.lang.String value10 = null;
> /* 040 */ if (!isNull11) {
> /* 041 */
> /* 042 */   isNull10 = false;
> /* 043 */   if (!isNull10) {
> /* 044 */
> /* 045 */ Object funcResult4 = null;
> /* 046 */ funcResult4 = value11.toString();
> /* 047 */
> /* 048 */ if (funcResult4 != null) {
> /* 049 */   value10 = (java.lang.String) funcResult4;
> /* 050 */ } else {
> /* 051 */   isNull10 = true;
> /* 052 */ }
> /* 053 */
> /* 054 */
> /* 055 */   }
> /* 056 */ }
> /* 057 */ javaBean.setApp(value10);
> /* 058 */
> /* 059 */
> /* 060 */ boolean isNull13 = i.isNullAt(12);
> /* 061 */ long value13 = isNull13 ? -1L : (i.getLong(12));
> /* 062 */ boolean isNull12 = true;
> /* 063 */ java.lang.String value12 = null;
> /* 064 */ if (!isNull13) {
> /* 065 */
> /* 066 */   isNull12 = false;
> /* 067 */   if (!isNull12) {
> /* 068 */
> /* 069 */ Object funcResult5 = null;
> /* 070 */ funcResult5 = value13.toString();
> /* 071 */
> /* 072 */ if (funcResult5 != null) {
> /* 073 */   value12 = (java.lang.String) funcResult5;
> /* 074 */ } else {
> /* 075 */   isNull12 = true;
> /* 076 */ }
> /* 077 */
> /* 078 */
> /* 079 */   }
> /* 080 */ }
> /* 081 */ javaBean.setReasonCode(value12);
> /* 082 */
> /* 083 */   }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22000) org.codehaus.commons.compiler.CompileException: toString method is not declared

2019-10-28 Thread Shyam (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961159#comment-16961159
 ] 

Shyam commented on SPARK-22000:
---

[~srowen] I am using spark.2.4.1 version getting same error ... for more 
details please check this , 
[https://stackoverflow.com/questions/58593215/inserting-into-cassandra-table-from-spark-dataframe-results-in-org-codehaus-comm]
     how to fix this ?

> org.codehaus.commons.compiler.CompileException: toString method is not 
> declared
> ---
>
> Key: SPARK-22000
> URL: https://issues.apache.org/jira/browse/SPARK-22000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: testcase.zip
>
>
> the error message say that toString is not declared on "value13" which is 
> "long" type in generated code.
> i think value13 should be Long type.
> ==error message
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 70, Column 32: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 70, Column 32: A method named "toString" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> /* 033 */   private void apply1_2(InternalRow i) {
> /* 034 */
> /* 035 */
> /* 036 */ boolean isNull11 = i.isNullAt(1);
> /* 037 */ UTF8String value11 = isNull11 ? null : (i.getUTF8String(1));
> /* 038 */ boolean isNull10 = true;
> /* 039 */ java.lang.String value10 = null;
> /* 040 */ if (!isNull11) {
> /* 041 */
> /* 042 */   isNull10 = false;
> /* 043 */   if (!isNull10) {
> /* 044 */
> /* 045 */ Object funcResult4 = null;
> /* 046 */ funcResult4 = value11.toString();
> /* 047 */
> /* 048 */ if (funcResult4 != null) {
> /* 049 */   value10 = (java.lang.String) funcResult4;
> /* 050 */ } else {
> /* 051 */   isNull10 = true;
> /* 052 */ }
> /* 053 */
> /* 054 */
> /* 055 */   }
> /* 056 */ }
> /* 057 */ javaBean.setApp(value10);
> /* 058 */
> /* 059 */
> /* 060 */ boolean isNull13 = i.isNullAt(12);
> /* 061 */ long value13 = isNull13 ? -1L : (i.getLong(12));
> /* 062 */ boolean isNull12 = true;
> /* 063 */ java.lang.String value12 = null;
> /* 064 */ if (!isNull13) {
> /* 065 */
> /* 066 */   isNull12 = false;
> /* 067 */   if (!isNull12) {
> /* 068 */
> /* 069 */ Object funcResult5 = null;
> /* 070 */ funcResult5 = value13.toString();
> /* 071 */
> /* 072 */ if (funcResult5 != null) {
> /* 073 */   value12 = (java.lang.String) funcResult5;
> /* 074 */ } else {
> /* 075 */   isNull12 = true;
> /* 076 */ }
> /* 077 */
> /* 078 */
> /* 079 */   }
> /* 080 */ }
> /* 081 */ javaBean.setReasonCode(value12);
> /* 082 */
> /* 083 */   }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29623) do not allow multiple unit TO unit statements in interval literal syntax

2019-10-28 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-29623:
---

 Summary: do not allow multiple unit TO unit statements in interval 
literal syntax
 Key: SPARK-29623
 URL: https://issues.apache.org/jira/browse/SPARK-29623
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29622) do not allow leading 'interval' in the interval string format

2019-10-28 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-29622:
---

 Summary: do not allow leading 'interval' in the interval string 
format
 Key: SPARK-29622
 URL: https://issues.apache.org/jira/browse/SPARK-29622
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29599) Support pagination for session table in JDBC/ODBC Tab

2019-10-28 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29599.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26253
[https://github.com/apache/spark/pull/26253]

> Support pagination for session table in JDBC/ODBC Tab 
> --
>
> Key: SPARK-29599
> URL: https://issues.apache.org/jira/browse/SPARK-29599
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.0.0
>
>
> Support pagination for session table in JDBC/ODBC Session page 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29599) Support pagination for session table in JDBC/ODBC Tab

2019-10-28 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29599:


Assignee: angerszhu

> Support pagination for session table in JDBC/ODBC Tab 
> --
>
> Key: SPARK-29599
> URL: https://issues.apache.org/jira/browse/SPARK-29599
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
>
> Support pagination for session table in JDBC/ODBC Session page 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation

2019-10-28 Thread Suchintak Patnaik (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suchintak Patnaik updated SPARK-29621:
--
Labels: PySpark SparkSQL  (was: )

> Querying internal corrupt record column should not be allowed in filter 
> operation
> -
>
> Key: SPARK-29621
> URL: https://issues.apache.org/jira/browse/SPARK-29621
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>  Labels: PySpark, SparkSQL
>
> As per 
> *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
> _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when 
> the referenced columns only include the internal corrupt record column"_
> But it's allowing while querying only the internal corrupt record column in 
> case of *filter* operation.
> from pyspark.sql.types import *
> schema = 
> StructType([StructField("_corrupt_record",StringType(),False),StructField("Name",StringType(),False),StructField("Colour",StringType(),True),StructField("Price",IntegerType(),True),StructField("Quantity",IntegerType(),True)])
> df = spark.read.csv("fruit.csv",schema=schema,mode="PERMISSIVE")
> df.filter(df._corrupt_record.isNotNull()).show()   // Allowed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation

2019-10-28 Thread Suchintak Patnaik (Jira)
Suchintak Patnaik created SPARK-29621:
-

 Summary: Querying internal corrupt record column should not be 
allowed in filter operation
 Key: SPARK-29621
 URL: https://issues.apache.org/jira/browse/SPARK-29621
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Suchintak Patnaik


As per 
*https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
_"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the 
referenced columns only include the internal corrupt record column"_

But it's allowing while querying only the internal corrupt record column in 
case of *filter* operation.

from pyspark.sql.types import *
schema = 
StructType([StructField("_corrupt_record",StringType(),False),StructField("Name",StringType(),False),StructField("Colour",StringType(),True),StructField("Price",IntegerType(),True),StructField("Quantity",IntegerType(),True)])

df = spark.read.csv("fruit.csv",schema=schema,mode="PERMISSIVE")

df.filter(df._corrupt_record.isNotNull()).show()   // Allowed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29549) Union of DataSourceV2 datasources leads to duplicated results

2019-10-28 Thread Miguel Molina (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961043#comment-16961043
 ] 

Miguel Molina commented on SPARK-29549:
---

Hi [~hyukjin.kwon]. In Spark 2.4 it works correctly, this only affects 2.3.x.

> Union of DataSourceV2 datasources leads to duplicated results
> -
>
> Key: SPARK-29549
> URL: https://issues.apache.org/jira/browse/SPARK-29549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4
>Reporter: Miguel Molina
>Priority: Major
>
> Hello!
> I've discovered that when two DataSourceV2 data frames in a query of the 
> exact same shape are joined and there is an aggregation in the query, only 
> the first results are used. The rest get removed by the ReuseExchange rule 
> and reuse the results of the first data frame, leading to N times the first 
> data frame results.
>  
> I've put together a repository with an example project where this can be 
> reproduced: [https://github.com/erizocosmico/spark-union-issue]
>  
> Basically, doing this:
>  
> {code:java}
> val products = spark.sql("SELECT name, COUNT(*) as count FROM products GROUP 
> BY name")
> val users = spark.sql("SELECT name, COUNT(*) as count FROM users GROUP BY 
> name")
> products.union(users)
>  .select("name")
>  .show(truncate = false, numRows = 50){code}
>  
>  
> Where products is:
> {noformat}
> +-+---+
> |name |id |
> +-+---+
> |candy |1 |
> |chocolate|2 |
> |milk |3 |
> |cinnamon |4 |
> |pizza |5 |
> |pineapple|6 |
> +-+---+{noformat}
> And users is:
> {noformat}
> +---+---+
> |name |id |
> +---+---+
> |andy |1 |
> |alice |2 |
> |mike |3 |
> |mariah |4 |
> |eleanor|5 |
> |matthew|6 |
> +---+---+ {noformat}
>  
> Results are incorrect:
> {noformat}
> +-+
> |name |
> +-+
> |candy |
> |pizza |
> |chocolate|
> |cinnamon |
> |pineapple|
> |milk |
> |candy |
> |pizza |
> |chocolate|
> |cinnamon |
> |pineapple|
> |milk |
> +-+{noformat}
>  
> This is the plan explained:
>  
> {noformat}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('name, None)]
> +- AnalysisBarrier
>  +- Union
>  :- Aggregate [name#0], [name#0, count(1) AS count#8L]
>  : +- SubqueryAlias products
>  : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], 
> [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6]))
>  +- Aggregate [name#4], [name#4, count(1) AS count#12L]
>  +- SubqueryAlias users
>  +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], 
> [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6]))
> == Analyzed Logical Plan ==
> name: string
> Project [name#0]
> +- Union
>  :- Aggregate [name#0], [name#0, count(1) AS count#8L]
>  : +- SubqueryAlias products
>  : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], 
> [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6]))
>  +- Aggregate [name#4], [name#4, count(1) AS count#12L]
>  +- SubqueryAlias users
>  +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], 
> [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6]))
> == Optimized Logical Plan ==
> Union
> :- Aggregate [name#0], [name#0]
> : +- Project [name#0]
> : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], 
> [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6]))
> +- Aggregate [name#4], [name#4]
>  +- Project [name#4]
>  +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], 
> [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6]))
> == Physical Plan ==
> Union
> :- *(2) HashAggregate(keys=[name#0], functions=[], output=[name#0])
> : +- Exchange hashpartitioning(name#0, 200)
> : +- *(1) HashAggregate(keys=[name#0], functions=[], output=[name#0])
> : +- *(1) Project [name#0]
> : +- *(1) DataSourceV2Scan [name#0, id#1], DefaultReader(List([candy,1], 
> [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6]))
> +- *(4) HashAggregate(keys=[name#4], functions=[], output=[name#4])
>  +- ReusedExchange [name#4], Exchange hashpartitioning(name#0, 200)
> {noformat}
>  
>  
> In the physical plan, the first exchange is reused, but it shouldn't be 
> because both sources are not the same.
>  
> {noformat}
> == Physical Plan ==
> Union
> :- *(2) HashAggregate(keys=[name#0], functions=[], output=[name#0])
> : +- Exchange hashpartitioning(name#0, 200)
> : +- *(1) HashAggregate(keys=[name#0], functions=[], output=[name#0])
> : +- *(1) Project [name#0]
> : +- *(1) DataSourceV2Scan [name#0, id#1], DefaultReader(List([candy,1], 
> [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6]))
> +- *(4) HashAggregate(keys=[name#4], functions=[], output=[name#4])
>  +- ReusedExchange [name#4], Exchange hashpartitioning(name#0, 

[jira] [Updated] (SPARK-29620) UnsafeKVExternalSorterSuite failure on bigendian system

2019-10-28 Thread salamani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

salamani updated SPARK-29620:
-
Description: 
spark/sql/core# ../../build/mvn -Dtest=none 
-DwildcardSuites=org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite test



UnsafeKVExternalSorterSuite:
12:24:24.305 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
- kv sorting key schema [] and value schema [] *** FAILED ***
 java.lang.AssertionError: sizeInBytes (4) should be a multiple of 8
 at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.pointTo(UnsafeRow.java:168)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter$KVSorterIterator.next(UnsafeKVExternalSorter.java:297)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite.org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter(UnsafeKVExternalSorterSuite.scala:145)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply$mcV$sp(UnsafeKVExternalSorterSuite.scala:86)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86)
 at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 at org.scalatest.Transformer.apply(Transformer.scala:22)
 at org.scalatest.Transformer.apply(Transformer.scala:20)
 ...
- kv sorting key schema [int] and value schema [] *** FAILED ***
 java.lang.AssertionError: sizeInBytes (20) should be a multiple of 8
 at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.pointTo(UnsafeRow.java:168)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter$KVSorterIterator.next(UnsafeKVExternalSorter.java:297)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite.org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter(UnsafeKVExternalSorterSuite.scala:145)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply$mcV$sp(UnsafeKVExternalSorterSuite.scala:86)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86)
 at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 at org.scalatest.Transformer.apply(Transformer.scala:22)
 at org.scalatest.Transformer.apply(Transformer.scala:20)
 ...
- kv sorting key schema [] and value schema [int] *** FAILED ***
 java.lang.AssertionError: sizeInBytes (20) should be a multiple of 8
 at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.pointTo(UnsafeRow.java:168)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter$KVSorterIterator.next(UnsafeKVExternalSorter.java:297)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite.org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter(UnsafeKVExternalSorterSuite.scala:145)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply$mcV$sp(UnsafeKVExternalSorterSuite.scala:86)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86)
 at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 at org.scalatest.Transformer.apply(Transformer.scala:22)
 at org.scalatest.Transformer.apply(Transformer.scala:20)
 ...
- kv sorting key schema [int] and value schema 
[float,float,double,string,float] *** FAILED ***
 java.lang.AssertionError: sizeInBytes (2732) should be a multiple of 8
 at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.pointTo(UnsafeRow.java:168)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter$KVSorterIterator.next(UnsafeKVExternalSorter.java:297)
 at 

[jira] [Created] (SPARK-29620) UnsafeKVExternalSorterSuite failure on bigendian system

2019-10-28 Thread salamani (Jira)
salamani created SPARK-29620:


 Summary: UnsafeKVExternalSorterSuite failure on bigendian system
 Key: SPARK-29620
 URL: https://issues.apache.org/jira/browse/SPARK-29620
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 2.4.4
Reporter: salamani






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format

2019-10-28 Thread Suchintak Patnaik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961028#comment-16961028
 ] 

Suchintak Patnaik commented on SPARK-29234:
---

[~dongjoon]
[~yumwang]

Are the PRs backported to Spark version 2.3 and 2.4?

> bucketed table created by Spark SQL DataFrame is in SequenceFile format
> ---
>
> Key: SPARK-29234
> URL: https://issues.apache.org/jira/browse/SPARK-29234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>
> When we create a bucketed table as follows, it's input and output format are 
> getting displayed as SequenceFile format. But physically the files are 
> getting created in HDFS as the format specified by the user e.g. 
> orc,parquet,etc.
> df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")
> in Hive, DESCRIBE FORMATTED OrdersExample;
> describe formatted ordersExample;
> OK
> # col_name  data_type   comment
> col array   from deserializer
> # Storage Information
> SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Querying the same table in Hive is giving error.
> select * from OrdersExample;
> OK
> Failed with exception java.io.IOException:java.io.IOException: 
> hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
>  not a SequenceFile



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29459) Spark takes lot of time to initialize when the network is not connected to internet.

2019-10-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29459.
--
Resolution: Incomplete

1.5.2 is too old and EOL release. Can you try higher versions and see if it 
works?

> Spark takes lot of time to initialize when the network is not connected to 
> internet.
> 
>
> Key: SPARK-29459
> URL: https://issues.apache.org/jira/browse/SPARK-29459
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 1.5.2
>Reporter: Raman
>Priority: Major
>
> Hi, I have faced two problems in using Spark ML when my network is not 
> connected to the internet (airplane mode). All these issues occurred in 
> windows 10 and java 8 environment.
>  # Spark by default searches for the internal IP for various operations. 
> (like hosting spark web ui). There was some problem initializing spark and I 
> have fixed this issue by setting "spark.driver.host" to "localhost".
>  # It takes around 120 secs to initialize the spark. May I know that is there 
> any way to reduce this time. Like, If there is any param/config which related 
> to timeout can resolve this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29476) Add tooltip information for Thread Dump links and Thread details table columns in Executors Tab

2019-10-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961007#comment-16961007
 ] 

Hyukjin Kwon commented on SPARK-29476:
--

What does this Jira mean?

> Add tooltip information for Thread Dump links and Thread details table 
> columns in Executors Tab
> ---
>
> Key: SPARK-29476
> URL: https://issues.apache.org/jira/browse/SPARK-29476
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29477) Improve tooltip information for Streaming Tab

2019-10-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961005#comment-16961005
 ] 

Hyukjin Kwon commented on SPARK-29477:
--

Please describe what each Jira means clearly, [~abhishek.akg]

> Improve tooltip information for Streaming  Tab
> --
>
> Key: SPARK-29477
> URL: https://issues.apache.org/jira/browse/SPARK-29477
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Active Batches Table and Completed Batches can be re look and put proper tip 
> for Batch Time and Record column



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29590) Support hiding table in JDBC/ODBC server page in WebUI

2019-10-28 Thread shahid (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961006#comment-16961006
 ] 

shahid commented on SPARK-29590:


Hi [~hyukjin.kwon], This JIRA is similar to the JIRA 
https://issues.apache.org/jira/browse/SPARK-25575. This is for supporting in 
JDBC/ODBC server page

> Support hiding table in JDBC/ODBC server page in WebUI
> --
>
> Key: SPARK-29590
> URL: https://issues.apache.org/jira/browse/SPARK-29590
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.4.4, 3.0.0
>Reporter: shahid
>Priority: Minor
>
> Support hiding table in JDBC/ODBC server page in WebUI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29505) desc extended is case sensitive

2019-10-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29505:
-
Description: 
{code}
create table customer(id int, name String, *CName String*, address String, city 
String, pin int, country String);
insert into customer values(1,'Alfred','Maria','Obere Str 
57','Berlin',12209,'Germany');
insert into customer values(2,'Ana','trujilo','Adva de la','Maxico 
D.F.',05021,'Maxico');
insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 
2312','Maxico D.F.',05023,'Maxico');

analyze table customer compute statistics for columns cname; – *Success( Though 
cname is not as CName)*

desc extended customer cname; – Failed

jdbc:hive2://10.18.19.208:23040/default> desc extended customer *cname;*
+-+-+
| info_name | info_value |
+-+-+
| col_name | cname |
| data_type | string |
| comment | NULL |
| min | NULL |
| max | NULL |
| num_nulls | NULL |
| distinct_count | NULL |
| avg_col_len | NULL |
| max_col_len | NULL |
| histogram | NULL |
+-+--
{code}
 

But 

{code}
desc extended customer CName; – SUCCESS

0: jdbc:hive2://10.18.19.208:23040/default> desc extended customer *CName;*
+-+-+
| info_name | info_value |
+-+-+
| col_name | CName |
| data_type | string |
| comment | NULL |
| min | NULL |
| max | NULL |
| num_nulls | 0 |
| distinct_count | 3 |
| avg_col_len | 9 |
| max_col_len | 14 |
| histogram | NULL |
+-+-+
 {code}

 

  was:
{code}
create table customer(id int, name String, *CName String*, address String, city 
String, pin int, country String);
insert into customer values(1,'Alfred','Maria','Obere Str 
57','Berlin',12209,'Germany');
insert into customer values(2,'Ana','trujilo','Adva de la','Maxico 
D.F.',05021,'Maxico');
insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 
2312','Maxico D.F.',05023,'Maxico');

analyze table customer compute statistics for columns cname; – *Success( Though 
cname is not as CName)*

desc extended customer cname; – Failed

jdbc:hive2://10.18.19.208:23040/default> desc extended customer *cname;*
+-+-+
| info_name | info_value |
+-+-+
| col_name | cname |
| data_type | string |
| comment | NULL |
| min | NULL |
| max | NULL |
| num_nulls | NULL |
| distinct_count | NULL |
| avg_col_len | NULL |
| max_col_len | NULL |
| histogram | NULL |
+-+--
{code}
 

But 

desc extended customer CName; – SUCCESS

{code}
0: jdbc:hive2://10.18.19.208:23040/default> desc extended customer *CName;*
+-+-+
| info_name | info_value |
+-+-+
| col_name | CName |
| data_type | string |
| comment | NULL |
| min | NULL |
| max | NULL |
| num_nulls | 0 |
| distinct_count | 3 |
| avg_col_len | 9 |
| max_col_len | 14 |
| histogram | NULL |
+-+-+
 {code}

 


> desc extended   is case sensitive
> --
>
> Key: SPARK-29505
> URL: https://issues.apache.org/jira/browse/SPARK-29505
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> {code}
> create table customer(id int, name String, *CName String*, address String, 
> city String, pin int, country String);
> insert into customer values(1,'Alfred','Maria','Obere Str 
> 57','Berlin',12209,'Germany');
> insert into customer values(2,'Ana','trujilo','Adva de la','Maxico 
> D.F.',05021,'Maxico');
> insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 
> 2312','Maxico D.F.',05023,'Maxico');
> analyze table customer compute statistics for columns cname; – *Success( 
> Though cname is not as CName)*
> desc extended customer cname; – Failed
> jdbc:hive2://10.18.19.208:23040/default> desc extended customer *cname;*
> +-+-+
> | info_name | info_value |
> +-+-+
> | col_name | cname |
> | data_type | string |
> | comment | NULL |
> | min | NULL |
> | max | NULL |
> | num_nulls | NULL |
> | distinct_count | NULL |
> | avg_col_len | NULL |
> | max_col_len | NULL |
> | histogram | NULL |
> +-+--
> {code}
>  
> But 
> {code}
> desc extended customer CName; – SUCCESS
> 0: jdbc:hive2://10.18.19.208:23040/default> desc extended customer *CName;*
> +-+-+
> | info_name | info_value |
> +-+-+
> | col_name | CName |
> | data_type | string |
> | comment | NULL |
> | min | NULL |
> | max | NULL |
> | num_nulls | 0 |
> | distinct_count | 3 |
> | avg_col_len | 9 |
> | max_col_len | 14 |
> | histogram | NULL |
> 

[jira] [Updated] (SPARK-29505) desc extended is case sensitive

2019-10-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29505:
-
Description: 
{code}
create table customer(id int, name String, *CName String*, address String, city 
String, pin int, country String);
insert into customer values(1,'Alfred','Maria','Obere Str 
57','Berlin',12209,'Germany');
insert into customer values(2,'Ana','trujilo','Adva de la','Maxico 
D.F.',05021,'Maxico');
insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 
2312','Maxico D.F.',05023,'Maxico');

analyze table customer compute statistics for columns cname; – *Success( Though 
cname is not as CName)*

desc extended customer cname; – Failed

jdbc:hive2://10.18.19.208:23040/default> desc extended customer *cname;*
+-+-+
| info_name | info_value |
+-+-+
| col_name | cname |
| data_type | string |
| comment | NULL |
| min | NULL |
| max | NULL |
| num_nulls | NULL |
| distinct_count | NULL |
| avg_col_len | NULL |
| max_col_len | NULL |
| histogram | NULL |
+-+--
{code}
 

But 

desc extended customer CName; – SUCCESS

{code}
0: jdbc:hive2://10.18.19.208:23040/default> desc extended customer *CName;*
+-+-+
| info_name | info_value |
+-+-+
| col_name | CName |
| data_type | string |
| comment | NULL |
| min | NULL |
| max | NULL |
| num_nulls | 0 |
| distinct_count | 3 |
| avg_col_len | 9 |
| max_col_len | 14 |
| histogram | NULL |
+-+-+
 {code}

 

  was:
create table customer(id int, name String, *CName String*, address String, city 
String, pin int, country String);
insert into customer values(1,'Alfred','Maria','Obere Str 
57','Berlin',12209,'Germany');
insert into customer values(2,'Ana','trujilo','Adva de la','Maxico 
D.F.',05021,'Maxico');
insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 
2312','Maxico D.F.',05023,'Maxico');

analyze table customer compute statistics for columns cname; – *Success( Though 
cname is not as CName)*

desc extended customer cname; – Failed

jdbc:hive2://10.18.19.208:23040/default> desc extended customer *cname;*
+-+-+
| info_name | info_value |
+-+-+
| col_name | cname |
| data_type | string |
| comment | NULL |
| min | NULL |
| max | NULL |
| num_nulls | NULL |
| distinct_count | NULL |
| avg_col_len | NULL |
| max_col_len | NULL |
| histogram | NULL |
+-+--

 

But 

desc extended customer CName; – SUCCESS

0: jdbc:hive2://10.18.19.208:23040/default> desc extended customer *CName;*
+-+-+
| info_name | info_value |
+-+-+
| col_name | CName |
| data_type | string |
| comment | NULL |
| min | NULL |
| max | NULL |
| num_nulls | 0 |
| distinct_count | 3 |
| avg_col_len | 9 |
| max_col_len | 14 |
| histogram | NULL |
+-+-+

 

 

 


> desc extended   is case sensitive
> --
>
> Key: SPARK-29505
> URL: https://issues.apache.org/jira/browse/SPARK-29505
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> {code}
> create table customer(id int, name String, *CName String*, address String, 
> city String, pin int, country String);
> insert into customer values(1,'Alfred','Maria','Obere Str 
> 57','Berlin',12209,'Germany');
> insert into customer values(2,'Ana','trujilo','Adva de la','Maxico 
> D.F.',05021,'Maxico');
> insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 
> 2312','Maxico D.F.',05023,'Maxico');
> analyze table customer compute statistics for columns cname; – *Success( 
> Though cname is not as CName)*
> desc extended customer cname; – Failed
> jdbc:hive2://10.18.19.208:23040/default> desc extended customer *cname;*
> +-+-+
> | info_name | info_value |
> +-+-+
> | col_name | cname |
> | data_type | string |
> | comment | NULL |
> | min | NULL |
> | max | NULL |
> | num_nulls | NULL |
> | distinct_count | NULL |
> | avg_col_len | NULL |
> | max_col_len | NULL |
> | histogram | NULL |
> +-+--
> {code}
>  
> But 
> desc extended customer CName; – SUCCESS
> {code}
> 0: jdbc:hive2://10.18.19.208:23040/default> desc extended customer *CName;*
> +-+-+
> | info_name | info_value |
> +-+-+
> | col_name | CName |
> | data_type | string |
> | comment | NULL |
> | min | NULL |
> | max | NULL |
> | num_nulls | 0 |
> | distinct_count | 3 |
> | avg_col_len | 9 |
> | max_col_len | 14 |
> | histogram | NULL |
> +-+-+
>  {code}
>  

[jira] [Updated] (SPARK-29510) JobGroup ID is not set for the job submitted from Spark-SQL and Spark -Shell

2019-10-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29510:
-
Description: 
When user submit jobs from Spark-shell or SQL Job group id is not set. UI 
Screen shot attached.

But from beeline job Group ID is set.

Steps:


{code:java}
create table customer(id int, name String, CName String, address String, city 
String, pin int, country String);
insert into customer values(1,'Alfred','Maria','Obere Str 
57','Berlin',12209,'Germany');
insert into customer values(2,'Ana','trujilo','Adva de la','Maxico 
D.F.',05021,'Maxico');
insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 
2312','Maxico D.F.',05023,'Maxico'); {code}
 

  was:
When user submit jobs from Spark-shell or SQL Job group id is not set. UI 
Screen shot attached.

But from beeline job Group ID is set.

Steps:

create table customer(id int, name String, CName String, address String, city 
String, pin int, country String);
insert into customer values(1,'Alfred','Maria','Obere Str 
57','Berlin',12209,'Germany');
insert into customer values(2,'Ana','trujilo','Adva de la','Maxico 
D.F.',05021,'Maxico');
insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 
2312','Maxico D.F.',05023,'Maxico');


> JobGroup ID is not set for the job submitted from Spark-SQL and Spark -Shell
> 
>
> Key: SPARK-29510
> URL: https://issues.apache.org/jira/browse/SPARK-29510
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: JobGroup1.png, JobGroup2.png, JobGroup3.png
>
>
> When user submit jobs from Spark-shell or SQL Job group id is not set. UI 
> Screen shot attached.
> But from beeline job Group ID is set.
> Steps:
> {code:java}
> create table customer(id int, name String, CName String, address String, city 
> String, pin int, country String);
> insert into customer values(1,'Alfred','Maria','Obere Str 
> 57','Berlin',12209,'Germany');
> insert into customer values(2,'Ana','trujilo','Adva de la','Maxico 
> D.F.',05021,'Maxico');
> insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 
> 2312','Maxico D.F.',05023,'Maxico'); {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29534) Hanging tasks in DiskBlockObjectWriter.commitAndGet while calling native FileDispatcherImpl.size0

2019-10-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961001#comment-16961001
 ] 

Hyukjin Kwon commented on SPARK-29534:
--

Can you post the codes you ran as well?

>  Hanging tasks in DiskBlockObjectWriter.commitAndGet while calling native 
> FileDispatcherImpl.size0
> --
>
> Key: SPARK-29534
> URL: https://issues.apache.org/jira/browse/SPARK-29534
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.2
>Reporter: Bogdan
>Priority: Major
>
> Tasks are hanging in native _FileDispatcherImpl.size0_ with calling site from 
> _DiskBlockObjectWriter.commitAndGet_. Maybe enhance _DiskBlockObjectWriter_ 
> to handle this type of behaviour? 
>  
> The behaviour is consistent and happens on a daily basis. It has been 
> temporarily addressed by using speculative execution. However, for longer 
> tasks (1h+) the impact in runtime is significant.
>  
> |sun.nio.ch.FileDispatcherImpl.size0(Native Method)
>  sun.nio.ch.FileDispatcherImpl.size(FileDispatcherImpl.java:88)
>  sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:264) => holding 
> Monitor(java.lang.Object@853892634})
>  
> org.apache.spark.storage.DiskBlockObjectWriter.commitAndGet(DiskBlockObjectWriter.scala:183)
>  
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:204)
>  
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:272)
>  org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)
>  
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:403)
>  
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:267)
>  
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:188)
>  org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>  org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
>  org.apache.spark.scheduler.Task.run(Task.scala:121)
>  
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>  
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  java.lang.Thread.run(Thread.java:748)|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29549) Union of DataSourceV2 datasources leads to duplicated results

2019-10-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960999#comment-16960999
 ] 

Hyukjin Kwon commented on SPARK-29549:
--

Hi [~erizocosmico]. Spark 2.3.x is EOL. Can you try it out in 2.4?

> Union of DataSourceV2 datasources leads to duplicated results
> -
>
> Key: SPARK-29549
> URL: https://issues.apache.org/jira/browse/SPARK-29549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4
>Reporter: Miguel Molina
>Priority: Major
>
> Hello!
> I've discovered that when two DataSourceV2 data frames in a query of the 
> exact same shape are joined and there is an aggregation in the query, only 
> the first results are used. The rest get removed by the ReuseExchange rule 
> and reuse the results of the first data frame, leading to N times the first 
> data frame results.
>  
> I've put together a repository with an example project where this can be 
> reproduced: [https://github.com/erizocosmico/spark-union-issue]
>  
> Basically, doing this:
>  
> {code:java}
> val products = spark.sql("SELECT name, COUNT(*) as count FROM products GROUP 
> BY name")
> val users = spark.sql("SELECT name, COUNT(*) as count FROM users GROUP BY 
> name")
> products.union(users)
>  .select("name")
>  .show(truncate = false, numRows = 50){code}
>  
>  
> Where products is:
> {noformat}
> +-+---+
> |name |id |
> +-+---+
> |candy |1 |
> |chocolate|2 |
> |milk |3 |
> |cinnamon |4 |
> |pizza |5 |
> |pineapple|6 |
> +-+---+{noformat}
> And users is:
> {noformat}
> +---+---+
> |name |id |
> +---+---+
> |andy |1 |
> |alice |2 |
> |mike |3 |
> |mariah |4 |
> |eleanor|5 |
> |matthew|6 |
> +---+---+ {noformat}
>  
> Results are incorrect:
> {noformat}
> +-+
> |name |
> +-+
> |candy |
> |pizza |
> |chocolate|
> |cinnamon |
> |pineapple|
> |milk |
> |candy |
> |pizza |
> |chocolate|
> |cinnamon |
> |pineapple|
> |milk |
> +-+{noformat}
>  
> This is the plan explained:
>  
> {noformat}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('name, None)]
> +- AnalysisBarrier
>  +- Union
>  :- Aggregate [name#0], [name#0, count(1) AS count#8L]
>  : +- SubqueryAlias products
>  : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], 
> [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6]))
>  +- Aggregate [name#4], [name#4, count(1) AS count#12L]
>  +- SubqueryAlias users
>  +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], 
> [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6]))
> == Analyzed Logical Plan ==
> name: string
> Project [name#0]
> +- Union
>  :- Aggregate [name#0], [name#0, count(1) AS count#8L]
>  : +- SubqueryAlias products
>  : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], 
> [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6]))
>  +- Aggregate [name#4], [name#4, count(1) AS count#12L]
>  +- SubqueryAlias users
>  +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], 
> [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6]))
> == Optimized Logical Plan ==
> Union
> :- Aggregate [name#0], [name#0]
> : +- Project [name#0]
> : +- DataSourceV2Relation [name#0, id#1], DefaultReader(List([candy,1], 
> [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6]))
> +- Aggregate [name#4], [name#4]
>  +- Project [name#4]
>  +- DataSourceV2Relation [name#4, id#5], DefaultReader(List([andy,1], 
> [alice,2], [mike,3], [mariah,4], [eleanor,5], [matthew,6]))
> == Physical Plan ==
> Union
> :- *(2) HashAggregate(keys=[name#0], functions=[], output=[name#0])
> : +- Exchange hashpartitioning(name#0, 200)
> : +- *(1) HashAggregate(keys=[name#0], functions=[], output=[name#0])
> : +- *(1) Project [name#0]
> : +- *(1) DataSourceV2Scan [name#0, id#1], DefaultReader(List([candy,1], 
> [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6]))
> +- *(4) HashAggregate(keys=[name#4], functions=[], output=[name#4])
>  +- ReusedExchange [name#4], Exchange hashpartitioning(name#0, 200)
> {noformat}
>  
>  
> In the physical plan, the first exchange is reused, but it shouldn't be 
> because both sources are not the same.
>  
> {noformat}
> == Physical Plan ==
> Union
> :- *(2) HashAggregate(keys=[name#0], functions=[], output=[name#0])
> : +- Exchange hashpartitioning(name#0, 200)
> : +- *(1) HashAggregate(keys=[name#0], functions=[], output=[name#0])
> : +- *(1) Project [name#0]
> : +- *(1) DataSourceV2Scan [name#0, id#1], DefaultReader(List([candy,1], 
> [chocolate,2], [milk,3], [cinnamon,4], [pizza,5], [pineapple,6]))
> +- *(4) HashAggregate(keys=[name#4], functions=[], output=[name#4])
>  +- ReusedExchange [name#4], Exchange hashpartitioning(name#0, 200){noformat}
>  

[jira] [Commented] (SPARK-29561) Large Case Statement Code Generation OOM

2019-10-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960996#comment-16960996
 ] 

Hyukjin Kwon commented on SPARK-29561:
--

Seems like it was just the leak of memory. Does it work when you increase the 
memory?

> Large Case Statement Code Generation OOM
> 
>
> Key: SPARK-29561
> URL: https://issues.apache.org/jira/browse/SPARK-29561
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michael Chen
>Priority: Major
> Attachments: apacheSparkCase.sql
>
>
> Spark Configuration
> spark.driver.memory = 1g
>  spark.master = "local"
>  spark.deploy.mode = "client"
> Try to execute a case statement with 3000+ branches. Added sql statement as 
> attachment
>  Spark runs for a while before it OOM
> {noformat}
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:182)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320)
>   at 
> org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178)
>   at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73)
> 19/10/22 16:19:54 ERROR FileFormatWriter: Aborting job null.
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at java.util.HashMap.newNode(HashMap.java:1750)
>   at java.util.HashMap.putVal(HashMap.java:631)
>   at java.util.HashMap.putMapEntries(HashMap.java:515)
>   at java.util.HashMap.putAll(HashMap.java:785)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3345)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:212)
>   at 
> org.codehaus.janino.UnitCompiler$8.visitLocalVariableDeclarationStatement(UnitCompiler.java:3230)
>   at 
> org.codehaus.janino.UnitCompiler$8.visitLocalVariableDeclarationStatement(UnitCompiler.java:3198)
>   at 
> org.codehaus.janino.Java$LocalVariableDeclarationStatement.accept(Java.java:3351)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3254)
>   at org.codehaus.janino.UnitCompiler.access$3900(UnitCompiler.java:212)
>   at org.codehaus.janino.UnitCompiler$8.visitBlock(UnitCompiler.java:3216)
>   at org.codehaus.janino.UnitCompiler$8.visitBlock(UnitCompiler.java:3198)
>   at org.codehaus.janino.Java$Block.accept(Java.java:2756)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3260)
>   at org.codehaus.janino.UnitCompiler.access$4000(UnitCompiler.java:212)
>   at 
> org.codehaus.janino.UnitCompiler$8.visitDoStatement(UnitCompiler.java:3217)
>   at 
> org.codehaus.janino.UnitCompiler$8.visitDoStatement(UnitCompiler.java:3198)
>   at org.codehaus.janino.Java$DoStatement.accept(Java.java:3304)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3186)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3009)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958)
>   at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:212)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:393)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:385)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1286)
> 19/10/22 16:19:54 ERROR Utils: throw uncaught fatal error in thread Spark 
> Context Cleaner
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:182)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320)
>   at 
> org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178)
>   at 
> org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73){noformat}
>  Generated code looks like
> {noformat}
> /* 029 */   private void project_doConsume(InternalRow scan_row, UTF8String 
> 

[jira] [Commented] (SPARK-29573) Spark should work as PostgreSQL when using + Operator

2019-10-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960995#comment-16960995
 ] 

Hyukjin Kwon commented on SPARK-29573:
--

[~abhishek.akg] can you please use \{code\} ... \{code\} block?

> Spark should work as PostgreSQL when using + Operator
> -
>
> Key: SPARK-29573
> URL: https://issues.apache.org/jira/browse/SPARK-29573
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Spark and PostgreSQL result is different when concatenating as below :
> Spark : Giving NULL result 
> {code}
> 0: jdbc:hive2://10.18.19.208:23040/default> select * from emp12;
> +-+-+
> | id | name |
> +-+-+
> | 20 | test |
> | 10 | number |
> +-+-+
> 2 rows selected (3.683 seconds)
> 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as 
> address from emp12;
> +-+--+
> | ID | address |
> +-+--+
> | 20 | NULL |
> | 10 | NULL |
> +-+--+
> 2 rows selected (0.649 seconds)
> 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as 
> address from emp12;
> +-+--+
> | ID | address |
> +-+--+
> | 20 | NULL |
> | 10 | NULL |
> +-+--+
> 2 rows selected (0.406 seconds)
> 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+','+name as 
> address from emp12;
> +-+--+
> | ID | address |
> +-+--+
> | 20 | NULL |
> | 10 | NULL |
> +-+--+
> {code}
>  
> PostgreSQL: Saying throwing Error saying not supported
> {code}
> create table emp12(id int,name varchar(255));
> insert into emp12 values(10,'number');
> insert into emp12 values(20,'test');
> select id as ID, id+','+name as address from emp12;
> output: invalid input syntax for integer: ","
> create table emp12(id int,name varchar(255));
> insert into emp12 values(10,'number');
> insert into emp12 values(20,'test');
> select id as ID, id+name as address from emp12;
> Output: 42883: operator does not exist: integer + character varying
> {code}
>  
> It should throw Error in Spark if it is not supported.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29573) Spark should work as PostgreSQL when using + Operator

2019-10-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29573:
-
Description: 
Spark and PostgreSQL result is different when concatenating as below :

Spark : Giving NULL result 

{code}
0: jdbc:hive2://10.18.19.208:23040/default> select * from emp12;
+-+-+
| id | name |
+-+-+
| 20 | test |
| 10 | number |
+-+-+
2 rows selected (3.683 seconds)

0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as address 
from emp12;
+-+--+
| ID | address |
+-+--+
| 20 | NULL |
| 10 | NULL |
+-+--+
2 rows selected (0.649 seconds)

0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as address 
from emp12;
+-+--+
| ID | address |
+-+--+
| 20 | NULL |
| 10 | NULL |
+-+--+
2 rows selected (0.406 seconds)

0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+','+name as 
address from emp12;
+-+--+
| ID | address |
+-+--+
| 20 | NULL |
| 10 | NULL |
+-+--+
{code}

 

PostgreSQL: Saying throwing Error saying not supported

{code}
create table emp12(id int,name varchar(255));
insert into emp12 values(10,'number');
insert into emp12 values(20,'test');
select id as ID, id+','+name as address from emp12;

output: invalid input syntax for integer: ","

create table emp12(id int,name varchar(255));
insert into emp12 values(10,'number');
insert into emp12 values(20,'test');
select id as ID, id+name as address from emp12;

Output: 42883: operator does not exist: integer + character varying
{code}

 

It should throw Error in Spark if it is not supported.

 

  was:
Spark and PostgreSQL result is different when concatenating as below :

Spark : Giving NULL result 

0: jdbc:hive2://10.18.19.208:23040/default> select * from emp12;
+-+-+
| id | name |
+-+-+
| 20 | test |
| 10 | number |
+-+-+
2 rows selected (3.683 seconds)
0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as address 
from emp12;
+-+--+
| ID | address |
+-+--+
| 20 | NULL |
| 10 | NULL |
+-+--+
2 rows selected (0.649 seconds)
0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as address 
from emp12;
+-+--+
| ID | address |
+-+--+
| 20 | NULL |
| 10 | NULL |
+-+--+
2 rows selected (0.406 seconds)
0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+','+name as 
address from emp12;
+-+--+
| ID | address |
+-+--+
| 20 | NULL |
| 10 | NULL |
+-+--+

 

PostgreSQL: Saying throwing Error saying not supported

create table emp12(id int,name varchar(255));
insert into emp12 values(10,'number');
insert into emp12 values(20,'test');
select id as ID, id+','+name as address from emp12;

output: invalid input syntax for integer: ","

create table emp12(id int,name varchar(255));
insert into emp12 values(10,'number');
insert into emp12 values(20,'test');
select id as ID, id+name as address from emp12;

Output: 42883: operator does not exist: integer + character varying

 

It should throw Error in Spark if it is not supported.

 


> Spark should work as PostgreSQL when using + Operator
> -
>
> Key: SPARK-29573
> URL: https://issues.apache.org/jira/browse/SPARK-29573
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Spark and PostgreSQL result is different when concatenating as below :
> Spark : Giving NULL result 
> {code}
> 0: jdbc:hive2://10.18.19.208:23040/default> select * from emp12;
> +-+-+
> | id | name |
> +-+-+
> | 20 | test |
> | 10 | number |
> +-+-+
> 2 rows selected (3.683 seconds)
> 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as 
> address from emp12;
> +-+--+
> | ID | address |
> +-+--+
> | 20 | NULL |
> | 10 | NULL |
> +-+--+
> 2 rows selected (0.649 seconds)
> 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+name as 
> address from emp12;
> +-+--+
> | ID | address |
> +-+--+
> | 20 | NULL |
> | 10 | NULL |
> +-+--+
> 2 rows selected (0.406 seconds)
> 0: jdbc:hive2://10.18.19.208:23040/default> select id as ID, id+','+name as 
> address from emp12;
> +-+--+
> | ID | address |
> +-+--+
> | 20 | NULL |
> | 10 | NULL |
> +-+--+
> {code}
>  
> PostgreSQL: Saying throwing Error saying not supported
> {code}
> create table emp12(id int,name varchar(255));
> insert into emp12 values(10,'number');
> insert into emp12 values(20,'test');
> select id as ID, 

[jira] [Updated] (SPARK-29575) from_json can produce nulls for fields which are marked as non-nullable

2019-10-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29575:
-
Component/s: SQL

> from_json can produce nulls for fields which are marked as non-nullable
> ---
>
> Key: SPARK-29575
> URL: https://issues.apache.org/jira/browse/SPARK-29575
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Victor Lopez
>Priority: Major
>
> I believe this issue was resolved elsewhere 
> (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this 
> bug seems to still be there.
> The issue appears when using {{from_json}} to parse a column in a Spark 
> dataframe. It seems like {{from_json}} ignores whether the schema provided 
> has any {{nullable:False}} property.
> {code:java}
> schema = T.StructType().add(T.StructField('id', T.LongType(), 
> nullable=False)).add(T.StructField('name', T.StringType(), nullable=False))
> data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 
> 'jane'})}]
> df = spark.read.json(sc.parallelize(data))
> df.withColumn("details", F.from_json("user", 
> schema)).select("details.*").show()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29590) Support hiding table in JDBC/ODBC server page in WebUI

2019-10-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960992#comment-16960992
 ] 

Hyukjin Kwon commented on SPARK-29590:
--

[~shahid] what does this JIRA mean?

> Support hiding table in JDBC/ODBC server page in WebUI
> --
>
> Key: SPARK-29590
> URL: https://issues.apache.org/jira/browse/SPARK-29590
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.4.4, 3.0.0
>Reporter: shahid
>Priority: Minor
>
> Support hiding table in JDBC/ODBC server page in WebUI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29598) Support search option for statistics in JDBC/ODBC server page

2019-10-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960991#comment-16960991
 ] 

Hyukjin Kwon commented on SPARK-29598:
--

[~jobitmathew] can you please elaborate what this JIRA means?

> Support search option for statistics in JDBC/ODBC server page
> -
>
> Key: SPARK-29598
> URL: https://issues.apache.org/jira/browse/SPARK-29598
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.4.4, 3.0.0
>Reporter: jobit mathew
>Priority: Minor
>
> Support search option for statistics in JDBC/ODBC server page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29600) array_contains built in function is not backward compatible in 3.0

2019-10-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960988#comment-16960988
 ] 

Hyukjin Kwon commented on SPARK-29600:
--

Please go ahead but please just don't copy and paste. Narrow down the problem 
with analysis otherwise no one can investigate.

> array_contains built in function is not backward compatible in 3.0
> --
>
> Key: SPARK-29600
> URL: https://issues.apache.org/jira/browse/SPARK-29600
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); throws 
> Exception in 3.0 where as in 2.3.2 is working fine.
> Spark 3.0 output:
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
> array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2);
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 
> 'array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1BD AS DECIMAL(13,3)), 
> CAST(0.2BD AS DECIMAL(13,3)), CAST(0.3BD AS DECIMAL(13,3)), CAST(0.5BD AS 
> DECIMAL(13,3)), CAST(0.02BD AS DECIMAL(13,3)), CAST(0.033BD AS 
> DECIMAL(13,3))), 0.2BD)' due to data type mismatch: Input to function 
> array_contains should have been array followed by a value with same element 
> type, but it's [array, decimal(1,1)].; line 1 pos 7;
> 'Project [unresolvedalias(array_contains(array(cast(0 as decimal(13,3)), 
> cast(0.1 as decimal(13,3)), cast(0.2 as decimal(13,3)), cast(0.3 as 
> decimal(13,3)), cast(0.5 as decimal(13,3)), cast(0.02 as decimal(13,3)), 
> cast(0.033 as decimal(13,3))), 0.2), None)]
> +- OneRowRelation (state=,code=0)
> 0: jdbc:hive2://10.18.19.208:23040/default> ne 1 pos 7;
> Error: org.apache.spark.sql.catalyst.parser.ParseException:
>  
> mismatched input 'ne' expecting \{'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 
> 'CLEAR', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 
> 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 
> 'LOCK', 'MAP', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 
> 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 
> 'UNLOCK', 'USE', 'VALUES', 'WITH'}(line 1, pos 0)
> == SQL ==
> ne 1 pos 7
> ^^^ (state=,code=0)
>  
> Spark 2.3.2 output
> 0: jdbc:hive2://10.18.18.214:23040/default> SELECT 
> array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2);
> +-+--+
> | array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1 AS DECIMAL(13,3)), 
> CAST(0.2 AS DECIMAL(13,3)), CAST(0.3 AS DECIMAL(13,3)), CAST(0.5 AS 
> DECIMAL(13,3)), CAST(0.02 AS DECIMAL(13,3)), CAST(0.033 AS DECIMAL(13,3))), 
> CAST(0.2 AS DECIMAL(13,3))) |
> +-+--+
> | true |
> +-+--+
> 1 row selected (0.18 seconds)
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29601) JDBC ODBC Tab Statement column should be provided ellipsis for bigger SQL statement

2019-10-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960987#comment-16960987
 ] 

Hyukjin Kwon commented on SPARK-29601:
--

Please go ahead but [~abhishek.akg] can you please don't copy and paste the 
full query? It's impossible to reproduce and debug.

> JDBC ODBC Tab Statement column should be provided ellipsis for bigger SQL 
> statement
> ---
>
> Key: SPARK-29601
> URL: https://issues.apache.org/jira/browse/SPARK-29601
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Statement column in JDBC/ODBC gives whole SQL statement and page size 
> increases.
> Suppose user submit and TPCDS Queries, then Page it display whole Query under 
> statement and User Experience is not good.
> Expected:
> It should display the ... Ellipsis and on clicking the stmt. it should Expand 
> display the whole SQL Statement.
> e.g.
> SELECT * FROM (SELECT count(*) h8_30_to_9 FROM store_sales, 
> household_demographics, time_dim, store WHERE ss_sold_time_sk = 
> time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND 
> ss_store_sk = s_store_sk AND time_dim.t_hour = 8 AND time_dim.t_minute >= 30 
> AND ((household_demographics.hd_dep_count = 4 AND 
> household_demographics.hd_vehicle_count <= 4 + 2) OR 
> (household_demographics.hd_dep_count = 2 AND 
> household_demographics.hd_vehicle_count <= 2 + 2) OR 
> (household_demographics.hd_dep_count = 0 AND 
> household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = 
> 'ese') s1, (SELECT count(*) h9_to_9_30 FROM store_sales, 
> household_demographics, time_dim, store WHERE ss_sold_time_sk = 
> time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND 
> ss_store_sk = s_store_sk AND time_dim.t_hour = 9 AND time_dim.t_minute < 30 
> AND ((household_demographics.hd_dep_count = 4 AND 
> household_demographics.hd_vehicle_count <= 4 + 2) OR 
> (household_demographics.hd_dep_count = 2 AND 
> household_demographics.hd_vehicle_count <= 2 + 2) OR 
> (household_demographics.hd_dep_count = 0 AND 
> household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = 
> 'ese') s2, (SELECT count(*) h9_30_to_10 FROM store_sales, 
> household_demographics, time_dim, store WHERE ss_sold_time_sk = 
> time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND 
> ss_store_sk = s_store_sk AND time_dim.t_hour = 9 AND time_dim.t_minute >= 30 
> AND ((household_demographics.hd_dep_count = 4 AND 
> household_demographics.hd_vehicle_count <= 4 + 2) OR 
> (household_demographics.hd_dep_count = 2 AND 
> household_demographics.hd_vehicle_count <= 2 + 2) OR 
> (household_demographics.hd_dep_count = 0 AND 
> household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = 
> 'ese') s3,(SELECT count(*) h10_to_10_30 FROM store_sales, 
> household_demographics, time_dim, store WHERE ss_sold_time_sk = 
> time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND 
> ss_store_sk = s_store_sk AND time_dim.t_hour = 10 AND time_dim.t_minute < 30 
> AND ((household_demographics.hd_dep_count = 4 AND 
> household_demographics.hd_vehicle_count <= 4 + 2) OR 
> (household_demographics.hd_dep_count = 2 AND 
> household_demographics.hd_vehicle_count <= 2 + 2) OR 
> (household_demographics.hd_dep_count = 0 AND 
> household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = 
> 'ese') s4,(SELECT count(*) h10_30_to_11 FROM 
> store_sales,household_demographics, time_dim, store WHERE ss_sold_time_sk = 
> time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND 
> ss_store_sk = s_store_sk AND time_dim.t_hour = 10 AND time_dim.t_minute >= 30 
> AND ((household_demographics.hd_dep_count = 4 AND 
> household_demographics.hd_vehicle_count <= 4 + 2) OR 
> (household_demographics.hd_dep_count = 2 AND 
> household_demographics.hd_vehicle_count <= 2 + 2) OR 
> (household_demographics.hd_dep_count = 0 AND 
> household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = 
> 'ese') s5,(SELECT count(*) h11_to_11_30 FROM store_sales, 
> household_demographics, time_dim, store WHERE ss_sold_time_sk = 
> time_dim.t_time_sk AND ss_hdemo_sk = household_demographics.hd_demo_sk AND 
> ss_store_sk = s_store_sk AND time_dim.t_hour = 11 AND time_dim.t_minute < 30 
> AND ((household_demographics.hd_dep_count = 4 AND 
> household_demographics.hd_vehicle_count <= 4 + 2) OR 
> (household_demographics.hd_dep_count = 2 AND 
> household_demographics.hd_vehicle_count <= 2 + 2) OR 
> (household_demographics.hd_dep_count = 0 AND 
> household_demographics.hd_vehicle_count <= 0 + 2)) AND store.s_store_name = 
> 'ese') s6,(SELECT count(*) h11_30_to_12 FROM 

[jira] [Resolved] (SPARK-29602) How does the spark from_json json and dataframe transform ignore the case of the json key

2019-10-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29602.
--
Resolution: Invalid

Please ask questions to stackoverflow or mailing list.

> How does the spark from_json json and dataframe transform ignore the case of 
> the json key
> -
>
> Key: SPARK-29602
> URL: https://issues.apache.org/jira/browse/SPARK-29602
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: ruiliang
>Priority: Major
>  Labels: spark-sql
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> How does the spark from_json json and dataframe transform ignore the case of 
> the json key
> code 
> {code:java}
> def main(args: Array[String]): Unit = {
>   val spark = SparkSession.builder().master("local[*]").
> enableHiveSupport().getOrCreate()
>   //spark.sqlContext.setConf("spark.sql.caseSensitive", "false")
>   import spark.implicits._
>   //hive table  data Lower case automatically when saving
>   val hivetable =
> 
> """{"deliverysystype":"dms","orderid":"B0001-N103-000-005882-RL3AI2RWCP","storeid":103,"timestamp":1571587522000,"":"dms"}"""
>   val hiveDF = Seq(hivetable).toDF("msg")
>   val rdd = hiveDF.rdd.map(_.getString(0))
>   val jsonDataDF = spark.read.json(rdd.toDS())
>   jsonDataDF.show(false)
>   
> //++---++---+-+
>   //||deliverysystype|orderid |storeid|timestamp  
>   |
>   
> //++---++---+-+
>   //|dms |dms|B0001-N103-000-005882-RL3AI2RWCP|103
> |1571587522000|
>   
> //++---++---+-+
>   val jsonstr =
>   
> """{"data":{"deliverySysType":"dms","orderId":"B0001-N103-000-005882-RL3AI2RWCP","storeId":103,"timestamp":1571587522000},"accessKey":"f9d069861dfb1678","actionName":"candao.rider.getDeliveryInfo","sign":"fa0239c75e065cf43d0a4040665578ba"
>  }"""
>   val jsonStrDF = Seq(jsonstr).toDF("msg")
>   //转换json数据列 action_nameactionName
>   jsonStrDF.show(false)
>   val structSeqSchme = StructType(Seq(StructField("data", jsonDataDF.schema, 
> true),
> StructField("accessKey", StringType, true), //这里应该 accessKey
> StructField("actionName", StringType, true),
> StructField("columnNameOfCorruptRecord", StringType, true)
>   ))
>   //hive col name lower case, json data key capital and small letter,Take 
> less than value
>   val mapOption = Map("allowBackslashEscapingAnyCharacter" -> "true", 
> "allowUnquotedControlChars" -> "true", "allowSingleQuotes" -> "true")
>   //I'm not doing anything here, but I don't know how to set a value, right?
>   val newDF = jsonStrDF.withColumn("data_col", from_json(col("msg"), 
> structSeqSchme, mapOption))
>   newDF.show(false)
>   newDF.printSchema()
>   newDF.select($"data_col.accessKey", $"data_col.actionName", 
> $"data_col.data.*", $"data_col.columnNameOfCorruptRecord").show(false)
>   //Lowercase columns do not fetch data. How do you make it ignore lowercase 
> columns?  deliverysystype,storeid-> null
>   
> //++++---+---+---+-+-+
>   //|accessKey   |actionName  
> ||deliverysystype|orderid|storeid|timestamp|columnNameOfCorruptRecord|
>   
> //++++---+---+---+-+-+
>   //|f9d069861dfb1678|candao.rider.getDeliveryInfo|null|null   |null  
>  |null   |1571587522000|null |
>   
> //++++---+---+---+-+-+
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29610) Keys with Null values are discarded when using to_json function

2019-10-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29610.
--
Resolution: Duplicate

> Keys with Null values are discarded when using to_json function
> ---
>
> Key: SPARK-29610
> URL: https://issues.apache.org/jira/browse/SPARK-29610
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4
>Reporter: Jonathan
>Priority: Major
>
> When calling to_json on a Struct if a key has Null as a value then the key is 
> thrown away.
> {code:java}
> import pyspark
> import pyspark.sql.functions as F
> l = [("a", "foo"), ("b", None)]
> df = spark.createDataFrame(l, ["id", "data"])
> (
>   df.select(F.struct("*").alias("payload"))
> .withColumn("payload", 
>   F.to_json(F.col("payload"))
> ).select("payload")
> .show()
> ){code}
> Produces the following output:
> {noformat}
> ++
> | payload|
> ++
> |{"id":"a","data":...|
> |  {"id":"b"}|
> ++{noformat}
> The `data` key in the second row has just been silently deleted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28158) Hive UDFs supports UDT type

2019-10-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28158.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 24961
[https://github.com/apache/spark/pull/24961]

> Hive UDFs supports UDT type
> ---
>
> Key: SPARK-28158
> URL: https://issues.apache.org/jira/browse/SPARK-28158
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Genmao Yu
>Assignee: Genmao Yu
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28158) Hive UDFs supports UDT type

2019-10-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28158:


Assignee: Genmao Yu

> Hive UDFs supports UDT type
> ---
>
> Key: SPARK-28158
> URL: https://issues.apache.org/jira/browse/SPARK-28158
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Genmao Yu
>Assignee: Genmao Yu
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29528) Upgrade scala-maven-plugin to 4.2.4 for Scala 2.13.1

2019-10-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29528:
-
Fix Version/s: (was: 3.0.0)

> Upgrade scala-maven-plugin to 4.2.4 for Scala 2.13.1
> 
>
> Key: SPARK-29528
> URL: https://issues.apache.org/jira/browse/SPARK-29528
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Scala 2.13.1 seems to break the binary compatibility.
> We need to upgrade `scala-maven-plugin` to bring the the following fixes for 
> the latest Scala 2.13.1. 
> - https://github.com/davidB/scala-maven-plugin/issues/363
> - https://github.com/sbt/zinc/issues/698



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29528) Upgrade scala-maven-plugin to 4.2.4 for Scala 2.13.1

2019-10-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960958#comment-16960958
 ] 

Hyukjin Kwon commented on SPARK-29528:
--

Reverted at 
https://github.com/apache/spark/commit/a8d5134981ef176965101e96beba71a9979a554f

> Upgrade scala-maven-plugin to 4.2.4 for Scala 2.13.1
> 
>
> Key: SPARK-29528
> URL: https://issues.apache.org/jira/browse/SPARK-29528
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Scala 2.13.1 seems to break the binary compatibility.
> We need to upgrade `scala-maven-plugin` to bring the the following fixes for 
> the latest Scala 2.13.1. 
> - https://github.com/davidB/scala-maven-plugin/issues/363
> - https://github.com/sbt/zinc/issues/698



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-29528) Upgrade scala-maven-plugin to 4.2.4 for Scala 2.13.1

2019-10-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-29528:
--

> Upgrade scala-maven-plugin to 4.2.4 for Scala 2.13.1
> 
>
> Key: SPARK-29528
> URL: https://issues.apache.org/jira/browse/SPARK-29528
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Scala 2.13.1 seems to break the binary compatibility.
> We need to upgrade `scala-maven-plugin` to bring the the following fixes for 
> the latest Scala 2.13.1. 
> - https://github.com/davidB/scala-maven-plugin/issues/363
> - https://github.com/sbt/zinc/issues/698



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29619) Improve the exception message when reading the daemon port and add retry times.

2019-10-28 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-29619:
---
Description: 
In production environment, my pyspark application occurs an exception and it's 
message as below:
{code:java}
19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: No port number in pyspark.daemon's stdout
 at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204)
 at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
 at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
 at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
 at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
 at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
 at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
 at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
 at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
 at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
 at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.run(Task.scala:121)
 at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745){code}
 

At first, I think a physical node has many ports are occupied by a large number 
of processes.

But I found the total number of ports in use is only 671.

 
{code:java}
[yarn@r1115 ~]$ netstat -a | wc -l
671
{code}
I  checked the code of PythonWorkerFactory in line 204 and found:
{code:java}
daemon = pb.start()
val in = new DataInputStream(daemon.getInputStream)
 try {
 daemonPort = in.readInt()
 } catch {
 case _: EOFException =>
 throw new SparkException(s"No port number in $daemonModule's stdout")
 }
{code}
I added some code here:
{code:java}
logError("Meet EOFException, daemon is alive: ${daemon.isAlive()}")
logError("Exit value: ${daemon.exitValue()}")
{code}
Then I recurrent the exception and it's message as below:
{code:java}
19/10/28 16:15:03 ERROR PythonWorkerFactory: Meet EOFException, daemon is 
alive: false
19/10/28 16:15:03 ERROR PythonWorkerFactory: Exit value: 139
19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: No port number in pyspark.daemon's stdout
 at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:206)
 at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
 at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
 at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
 at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
 at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
 at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
 at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
 at 

[jira] [Created] (SPARK-29619) Improve the exception message when reading the daemon port and add retry times.

2019-10-28 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-29619:
--

 Summary: Improve the exception message when reading the daemon 
port and add retry times.
 Key: SPARK-29619
 URL: https://issues.apache.org/jira/browse/SPARK-29619
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: jiaan.geng


In production environment, my pyspark application occurs an exception and it's 
message as below:

 
{code:java}
19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: No port number in pyspark.daemon's stdout
 at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204)
 at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
 at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
 at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
 at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
 at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
 at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
 at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
 at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
 at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
 at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.run(Task.scala:121)
 at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745){code}
 

 

At first, I think a physical node has many ports are occupied by a large number 
of processes.

But I found the total number of ports in use is only 671.

{{}}
{code:java}

{code}
{{[yarn@r1115 ~]$ netstat -a | wc -l 671}}

 

I  checked the code of PythonWorkerFactory in line 204 and found:

 
{code:java}
daemon = pb.start()
val in = new DataInputStream(daemon.getInputStream)
 try {
 daemonPort = in.readInt()
 } catch {
 case _: EOFException =>
 throw new SparkException(s"No port number in $daemonModule's stdout")
 }
{code}
 

I added some code here:

 
{code:java}
logError("Meet EOFException, daemon is alive: ${daemon.isAlive()}")
logError("Exit value: ${daemon.exitValue()}")
{code}
 

Then I recurrent the exception and it's message as below:
{code:java}
19/10/28 16:15:03 ERROR PythonWorkerFactory: Meet EOFException, daemon is 
alive: false
19/10/28 16:15:03 ERROR PythonWorkerFactory: Exit value: 139
19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: No port number in pyspark.daemon's stdout
 at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:206)
 at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
 at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
 at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
 at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
 at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
 at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
 at 

[jira] [Created] (SPARK-29618) remove V1_BATCH_WRITE table capability

2019-10-28 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-29618:
---

 Summary: remove V1_BATCH_WRITE table capability
 Key: SPARK-29618
 URL: https://issues.apache.org/jira/browse/SPARK-29618
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29580) Add kerberos debug messages for Kafka secure tests

2019-10-28 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960869#comment-16960869
 ] 

Gabor Somogyi commented on SPARK-29580:
---

[~dongjoon] thanks for taking care, ping me if reproduced and I'll continue the 
analysis...

> Add kerberos debug messages for Kafka secure tests
> --
>
> Key: SPARK-29580
> URL: https://issues.apache.org/jira/browse/SPARK-29580
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.0.0
>
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/112562/testReport/org.apache.spark.sql.kafka010/KafkaDelegationTokenSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
> {code}
> sbt.ForkMain$ForkError: org.apache.kafka.common.KafkaException: Failed to 
> create new KafkaAdminClient
>   at 
> org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:407)
>   at 
> org.apache.kafka.clients.admin.AdminClient.create(AdminClient.java:55)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedKafkaServer(KafkaTestUtils.scala:227)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:249)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: sbt.ForkMain$ForkError: org.apache.kafka.common.KafkaException: 
> javax.security.auth.login.LoginException: Server not found in Kerberos 
> database (7) - Server not found in Kerberos database
>   at 
> org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:160)
>   at 
> org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:146)
>   at 
> org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:67)
>   at 
> org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:99)
>   at 
> org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:382)
>   ... 16 more
> Caused by: sbt.ForkMain$ForkError: javax.security.auth.login.LoginException: 
> Server not found in Kerberos database (7) - Server not found in Kerberos 
> database
>   at 
> com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:804)
>   at 
> com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
>   at 
> javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
>   at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
>   at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
>   at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
>   at 
> org.apache.kafka.common.security.authenticator.AbstractLogin.login(AbstractLogin.java:60)
>   at 
> org.apache.kafka.common.security.kerberos.KerberosLogin.login(KerberosLogin.java:103)
>   at 
> org.apache.kafka.common.security.authenticator.LoginManager.(LoginManager.java:61)
>   at 
> org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:104)
>   at 
>