[jira] [Assigned] (SPARK-27653) Add max_by() / min_by() SQL aggregate functions

2019-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27653:


Assignee: (was: Apache Spark)

> Add max_by() / min_by() SQL aggregate functions
> ---
>
> Key: SPARK-27653
> URL: https://issues.apache.org/jira/browse/SPARK-27653
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> It would be useful if Spark SQL supported the {{max_by()}} SQL aggregate 
> function. Quoting from the [Presto 
> docs|https://prestodb.github.io/docs/current/functions/aggregate.html#max_by]:
> {quote}max_by(x, y) → [same as x]
>  Returns the value of x associated with the maximum value of y over all input 
> values.
> {quote}
> {{min_by}} works similarly.
> Technically I can emulate this behavior using window functions but the 
> resulting syntax is much more verbose and non-intuitive compared to 
> {{max_by}} / {{min_by}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27653) Add max_by() / min_by() SQL aggregate functions

2019-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27653:


Assignee: Apache Spark

> Add max_by() / min_by() SQL aggregate functions
> ---
>
> Key: SPARK-27653
> URL: https://issues.apache.org/jira/browse/SPARK-27653
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Major
>
> It would be useful if Spark SQL supported the {{max_by()}} SQL aggregate 
> function. Quoting from the [Presto 
> docs|https://prestodb.github.io/docs/current/functions/aggregate.html#max_by]:
> {quote}max_by(x, y) → [same as x]
>  Returns the value of x associated with the maximum value of y over all input 
> values.
> {quote}
> {{min_by}} works similarly.
> Technically I can emulate this behavior using window functions but the 
> resulting syntax is much more verbose and non-intuitive compared to 
> {{max_by}} / {{min_by}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames or Arrow batches

2019-05-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-26412:
-

Assignee: Weichen Xu

> Allow Pandas UDF to take an iterator of pd.DataFrames or Arrow batches
> --
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to batch scope, user need to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient. I created this JIRA to discuss possible solutions.
> Essentially we need to support "start()" and "finish()" besides "apply". We 
> can either provide those interfaces or simply provide users the iterator of 
> batches in pd.DataFrame or Arrow table and let user code handle it.
> Another benefit is with iterator interface and asyncio from Python, it is 
> flexible for users to implement data pipelining.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27658) Catalog API to load functions

2019-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27658:


Assignee: (was: Apache Spark)

> Catalog API to load functions
> -
>
> Key: SPARK-27658
> URL: https://issues.apache.org/jira/browse/SPARK-27658
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Priority: Major
>
> SPARK-24252 added an API that catalog plugins can implement to expose table 
> operations. Catalogs should also be able to provide function implementations 
> to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27658) Catalog API to load functions

2019-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27658:


Assignee: Apache Spark

> Catalog API to load functions
> -
>
> Key: SPARK-27658
> URL: https://issues.apache.org/jira/browse/SPARK-27658
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-24252 added an API that catalog plugins can implement to expose table 
> operations. Catalogs should also be able to provide function implementations 
> to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27661) Add SupportsNamespaces interface for v2 catalogs

2019-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27661:


Assignee: (was: Apache Spark)

> Add SupportsNamespaces interface for v2 catalogs
> 
>
> Key: SPARK-27661
> URL: https://issues.apache.org/jira/browse/SPARK-27661
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Priority: Major
>
> Some catalogs support namespace operations, like creating or dropping 
> namespaces. The v2 API should have a way to expose these operations to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27661) Add SupportsNamespaces interface for v2 catalogs

2019-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27661:


Assignee: Apache Spark

> Add SupportsNamespaces interface for v2 catalogs
> 
>
> Key: SPARK-27661
> URL: https://issues.apache.org/jira/browse/SPARK-27661
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>
> Some catalogs support namespace operations, like creating or dropping 
> namespaces. The v2 API should have a way to expose these operations to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27663) Task accomplished incompletely but marked as success

2019-05-08 Thread Fan Yunbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836055#comment-16836055
 ] 

Fan Yunbo commented on SPARK-27663:
---

[~jerryshao], [~srowen]

Could you please take a look?

> Task accomplished incompletely but marked as success
> 
>
> Key: SPARK-27663
> URL: https://issues.apache.org/jira/browse/SPARK-27663
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Fan Yunbo
>Priority: Major
> Attachments: image-2019-05-09-11-10-04-602.png, incomplte-task-0.png, 
> incomplte-task-1.png, incomplte-task-2.png, reran-0.png, reran-1.png
>
>
> It happens when running sql queries using spark sql.
> The task was accomplished incompletely but marked as success since there were 
> not any  exceptions and failed or killed tasks.
> When I checked the query result, it missed about 4000 records.
> The history web ui shows that the task input size is 23.5 MB, but the log in 
> the executor shows the split size is 326992763, about 300 MB.
> And this task was finished in 1 second, but others’ duration was about 15 
> seconds.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27663) Task accomplished incompletely but marked as success

2019-05-08 Thread Fan Yunbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836044#comment-16836044
 ] 

Fan Yunbo edited comment on SPARK-27663 at 5/9/19 3:21 AM:
---

The incomplete task's id is 17.0 in tage 98517.0

!incomplte-task-1.png!

the input size is 23.5 MB, and finished in 1 s  !incomplte-task-2.png!

and the log shows the input split size is about 300 MB
{code:java}
Input split: 
hdfs://cqocdc/user/hive/warehouse/dw_user_useage_privilege_dt_mmdd/month_id=201904/day_id=20190422/17_0.snappy:0+326992763{code}
{code:java}
19/04/23 12:09:18 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
6835988
19/04/23 12:09:18 INFO executor.Executor: Running task 17.0 in stage 98517.0 
(TID 6835988)
19/04/23 12:09:18 INFO broadcast.TorrentBroadcast: Started reading broadcast 
variable 173456
19/04/23 12:09:18 INFO memory.MemoryStore: Block broadcast_173456_piece0 stored 
as bytes in memory (estimated size 13.4 KB, free 15.2 GB)
19/04/23 12:09:18 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
173456 took 4 ms
19/04/23 12:09:18 INFO memory.MemoryStore: Block broadcast_173456 stored as 
values in memory (estimated size 30.3 KB, free 15.2 GB)
19/04/23 12:09:18 INFO rdd.HadoopRDD: Input split: 
hdfs://cqocdc/user/hive/warehouse/dw_user_useage_privilege_dt_mmdd/month_id=201904/day_id=20190422/17_0.snappy:0+326992763
19/04/23 12:09:18 INFO broadcast.TorrentBroadcast: Started reading broadcast 
variable 173452
19/04/23 12:09:18 INFO memory.MemoryStore: Block broadcast_173452_piece0 stored 
as bytes in memory (estimated size 30.8 KB, free 15.2 GB)
19/04/23 12:09:18 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
173452 took 3 ms
19/04/23 12:09:18 INFO memory.MemoryStore: Block broadcast_173452 stored as 
values in memory (estimated size 365.1 KB, free 15.3 GB)
19/04/23 12:09:18 INFO codegen.CodeGenerator: Code generated in 6.949728 ms
19/04/23 12:09:18 INFO codegen.CodeGenerator: Code generated in 20.909883 ms
19/04/23 12:09:18 INFO output.FileOutputCommitter: Saved output of task 
'attempt_20190423120856_98508_m_47_0' to 
hdfs://cqocdc/tmp/.staging/hive_hive_2019-04-23_12-08-56_154_3110404551071203558-1370/-ext-1/_temporary/0/task_20190423120856_98508_m_47
19/04/23 12:09:18 INFO mapred.SparkHadoopMapRedUtil: 
attempt_20190423120856_98508_m_47_0: Committed
19/04/23 12:09:18 INFO executor.Executor: Finished task 47.0 in stage 98508.0 
(TID 6835975). 3217 bytes result sent to driver
19/04/23 12:09:19 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 
TERM
19/04/23 12:09:19 INFO storage.DiskBlockManager: Shutdown hook called
19/04/23 12:09:19 INFO util.ShutdownHookManager: Shutdown hook called
19/04/23 12:09:19 INFO executor.Executor: Finished task 17.0 in stage 98517.0 
(TID 6835988). 3188 bytes result sent to driver
{code}
The file size and last modified time:

!image-2019-05-09-11-10-04-602.png!

The stage of the query total input is 14.9 G:

!incomplte-task-0.png!

 


was (Author: fanyunbojerry):
The incomplete task's id is 17.0 in tage 98517.0

!incomplte-task-1.png!

the input size is 23.5 MB, and finished in 1 s !incomplte-task-2.png!

and the log shows the input split size is
{code:java}
Input split: 
hdfs://cqocdc/user/hive/warehouse/dw_user_useage_privilege_dt_mmdd/month_id=201904/day_id=20190422/17_0.snappy:0+326992763{code}
{code:java}
19/04/23 12:09:18 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
6835988
19/04/23 12:09:18 INFO executor.Executor: Running task 17.0 in stage 98517.0 
(TID 6835988)
19/04/23 12:09:18 INFO broadcast.TorrentBroadcast: Started reading broadcast 
variable 173456
19/04/23 12:09:18 INFO memory.MemoryStore: Block broadcast_173456_piece0 stored 
as bytes in memory (estimated size 13.4 KB, free 15.2 GB)
19/04/23 12:09:18 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
173456 took 4 ms
19/04/23 12:09:18 INFO memory.MemoryStore: Block broadcast_173456 stored as 
values in memory (estimated size 30.3 KB, free 15.2 GB)
19/04/23 12:09:18 INFO rdd.HadoopRDD: Input split: 
hdfs://cqocdc/user/hive/warehouse/dw_user_useage_privilege_dt_mmdd/month_id=201904/day_id=20190422/17_0.snappy:0+326992763
19/04/23 12:09:18 INFO broadcast.TorrentBroadcast: Started reading broadcast 
variable 173452
19/04/23 12:09:18 INFO memory.MemoryStore: Block broadcast_173452_piece0 stored 
as bytes in memory (estimated size 30.8 KB, free 15.2 GB)
19/04/23 12:09:18 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
173452 took 3 ms
19/04/23 12:09:18 INFO memory.MemoryStore: Block broadcast_173452 stored as 
values in memory (estimated size 365.1 KB, free 15.3 GB)
19/04/23 12:09:18 INFO codegen.CodeGenerator: Code generated in 6.949728 ms
19/04/23 12:09:18 INFO codegen.CodeGenerator: Code generated in 20.909883 ms
19/04/23 12:09:18 INFO output.FileOutputCommitter: Saved 

[jira] [Resolved] (SPARK-27207) There exists a bug with SortBasedAggregator where merge()/update() operations get invoked on the aggregate buffer without calling initialize

2019-05-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27207.
-
Resolution: Not A Problem

> There exists a bug with SortBasedAggregator where merge()/update() operations 
> get invoked on the aggregate buffer without calling initialize
> 
>
> Key: SPARK-27207
> URL: https://issues.apache.org/jira/browse/SPARK-27207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> Normally, the aggregate operations that are invoked for an aggregation buffer 
> for User Defined Aggregate Functions(UDAF) follow the order like 
> initialize(), update(), eval() OR initialize(), merge(), eval(). However, 
> after a certain threshold configurable by 
> spark.sql.objectHashAggregate.sortBased.fallbackThreshold is reached, 
> ObjectHashAggregate falls back to SortBasedAggregator which invokes the merge 
> or update operation without calling initialize() on the aggregate buffer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27663) Task accomplished incompletely but marked as success

2019-05-08 Thread Fan Yunbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Yunbo updated SPARK-27663:
--
Attachment: (was: reran-1.png)

> Task accomplished incompletely but marked as success
> 
>
> Key: SPARK-27663
> URL: https://issues.apache.org/jira/browse/SPARK-27663
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Fan Yunbo
>Priority: Major
> Attachments: image-2019-05-09-11-10-04-602.png, incomplte-task-0.png, 
> incomplte-task-1.png, incomplte-task-2.png, reran-0.png, reran-1.png
>
>
> It happens when running sql queries using spark sql.
> The task was accomplished incompletely but marked as success since there were 
> not any  exceptions and failed or killed tasks.
> When I checked the query result, it missed about 4000 records.
> The history web ui shows that the task input size is 23.5 MB, but the log in 
> the executor shows the split size is 326992763, about 300 MB.
> And this task was finished in 1 second, but others’ duration was about 15 
> seconds.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27663) Task accomplished incompletely but marked as success

2019-05-08 Thread Fan Yunbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836054#comment-16836054
 ] 

Fan Yunbo commented on SPARK-27663:
---

Two thoughts:
 # The executor shut down hook was triggered when the task was running, it 
ended the task but marked it as success.
 # Some exceptions were happened in the task, but the executor didn't catch it 
or just ignored.

 

> Task accomplished incompletely but marked as success
> 
>
> Key: SPARK-27663
> URL: https://issues.apache.org/jira/browse/SPARK-27663
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Fan Yunbo
>Priority: Major
> Attachments: image-2019-05-09-11-10-04-602.png, incomplte-task-0.png, 
> incomplte-task-1.png, incomplte-task-2.png, reran-0.png, reran-1.png
>
>
> It happens when running sql queries using spark sql.
> The task was accomplished incompletely but marked as success since there were 
> not any  exceptions and failed or killed tasks.
> When I checked the query result, it missed about 4000 records.
> The history web ui shows that the task input size is 23.5 MB, but the log in 
> the executor shows the split size is 326992763, about 300 MB.
> And this task was finished in 1 second, but others’ duration was about 15 
> seconds.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27663) Task accomplished incompletely but marked as success

2019-05-08 Thread Fan Yunbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Yunbo updated SPARK-27663:
--
Attachment: reran-1.png
reran-0.png

> Task accomplished incompletely but marked as success
> 
>
> Key: SPARK-27663
> URL: https://issues.apache.org/jira/browse/SPARK-27663
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Fan Yunbo
>Priority: Major
> Attachments: image-2019-05-09-11-10-04-602.png, incomplte-task-0.png, 
> incomplte-task-1.png, incomplte-task-2.png, reran-0.png, reran-1.png
>
>
> It happens when running sql queries using spark sql.
> The task was accomplished incompletely but marked as success since there were 
> not any  exceptions and failed or killed tasks.
> When I checked the query result, it missed about 4000 records.
> The history web ui shows that the task input size is 23.5 MB, but the log in 
> the executor shows the split size is 326992763, about 300 MB.
> And this task was finished in 1 second, but others’ duration was about 15 
> seconds.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27663) Task accomplished incompletely but marked as success

2019-05-08 Thread Fan Yunbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836047#comment-16836047
 ] 

Fan Yunbo commented on SPARK-27663:
---

When I reran the query, it was finished as normal.

!reran-0.png!

!reran-1.png!

> Task accomplished incompletely but marked as success
> 
>
> Key: SPARK-27663
> URL: https://issues.apache.org/jira/browse/SPARK-27663
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Fan Yunbo
>Priority: Major
> Attachments: image-2019-05-09-11-10-04-602.png, incomplte-task-0.png, 
> incomplte-task-1.png, incomplte-task-2.png, reran-0.png, reran-1.png, 
> reran-1.png
>
>
> It happens when running sql queries using spark sql.
> The task was accomplished incompletely but marked as success since there were 
> not any  exceptions and failed or killed tasks.
> When I checked the query result, it missed about 4000 records.
> The history web ui shows that the task input size is 23.5 MB, but the log in 
> the executor shows the split size is 326992763, about 300 MB.
> And this task was finished in 1 second, but others’ duration was about 15 
> seconds.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27663) Task accomplished incompletely but marked as success

2019-05-08 Thread Fan Yunbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Yunbo updated SPARK-27663:
--
Attachment: reran-1.png

> Task accomplished incompletely but marked as success
> 
>
> Key: SPARK-27663
> URL: https://issues.apache.org/jira/browse/SPARK-27663
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Fan Yunbo
>Priority: Major
> Attachments: image-2019-05-09-11-10-04-602.png, incomplte-task-0.png, 
> incomplte-task-1.png, incomplte-task-2.png, reran-0.png, reran-1.png, 
> reran-1.png
>
>
> It happens when running sql queries using spark sql.
> The task was accomplished incompletely but marked as success since there were 
> not any  exceptions and failed or killed tasks.
> When I checked the query result, it missed about 4000 records.
> The history web ui shows that the task input size is 23.5 MB, but the log in 
> the executor shows the split size is 326992763, about 300 MB.
> And this task was finished in 1 second, but others’ duration was about 15 
> seconds.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27663) Task accomplished incompletely but marked as success

2019-05-08 Thread Fan Yunbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836044#comment-16836044
 ] 

Fan Yunbo commented on SPARK-27663:
---

The incomplete task's id is 17.0 in tage 98517.0

!incomplte-task-1.png!

the input size is 23.5 MB, and finished in 1 s !incomplte-task-2.png!

and the log shows the input split size is
{code:java}
Input split: 
hdfs://cqocdc/user/hive/warehouse/dw_user_useage_privilege_dt_mmdd/month_id=201904/day_id=20190422/17_0.snappy:0+326992763{code}
{code:java}
19/04/23 12:09:18 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
6835988
19/04/23 12:09:18 INFO executor.Executor: Running task 17.0 in stage 98517.0 
(TID 6835988)
19/04/23 12:09:18 INFO broadcast.TorrentBroadcast: Started reading broadcast 
variable 173456
19/04/23 12:09:18 INFO memory.MemoryStore: Block broadcast_173456_piece0 stored 
as bytes in memory (estimated size 13.4 KB, free 15.2 GB)
19/04/23 12:09:18 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
173456 took 4 ms
19/04/23 12:09:18 INFO memory.MemoryStore: Block broadcast_173456 stored as 
values in memory (estimated size 30.3 KB, free 15.2 GB)
19/04/23 12:09:18 INFO rdd.HadoopRDD: Input split: 
hdfs://cqocdc/user/hive/warehouse/dw_user_useage_privilege_dt_mmdd/month_id=201904/day_id=20190422/17_0.snappy:0+326992763
19/04/23 12:09:18 INFO broadcast.TorrentBroadcast: Started reading broadcast 
variable 173452
19/04/23 12:09:18 INFO memory.MemoryStore: Block broadcast_173452_piece0 stored 
as bytes in memory (estimated size 30.8 KB, free 15.2 GB)
19/04/23 12:09:18 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
173452 took 3 ms
19/04/23 12:09:18 INFO memory.MemoryStore: Block broadcast_173452 stored as 
values in memory (estimated size 365.1 KB, free 15.3 GB)
19/04/23 12:09:18 INFO codegen.CodeGenerator: Code generated in 6.949728 ms
19/04/23 12:09:18 INFO codegen.CodeGenerator: Code generated in 20.909883 ms
19/04/23 12:09:18 INFO output.FileOutputCommitter: Saved output of task 
'attempt_20190423120856_98508_m_47_0' to 
hdfs://cqocdc/tmp/.staging/hive_hive_2019-04-23_12-08-56_154_3110404551071203558-1370/-ext-1/_temporary/0/task_20190423120856_98508_m_47
19/04/23 12:09:18 INFO mapred.SparkHadoopMapRedUtil: 
attempt_20190423120856_98508_m_47_0: Committed
19/04/23 12:09:18 INFO executor.Executor: Finished task 47.0 in stage 98508.0 
(TID 6835975). 3217 bytes result sent to driver
19/04/23 12:09:19 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 
TERM
19/04/23 12:09:19 INFO storage.DiskBlockManager: Shutdown hook called
19/04/23 12:09:19 INFO util.ShutdownHookManager: Shutdown hook called
19/04/23 12:09:19 INFO executor.Executor: Finished task 17.0 in stage 98517.0 
(TID 6835988). 3188 bytes result sent to driver
{code}
The file size and last modified time:

!image-2019-05-09-11-10-04-602.png!

The stage of the query total input is 14.9 G:

!incomplte-task-0.png!

 

> Task accomplished incompletely but marked as success
> 
>
> Key: SPARK-27663
> URL: https://issues.apache.org/jira/browse/SPARK-27663
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Fan Yunbo
>Priority: Major
> Attachments: image-2019-05-09-11-10-04-602.png, incomplte-task-0.png, 
> incomplte-task-1.png, incomplte-task-2.png
>
>
> It happens when running sql queries using spark sql.
> The task was accomplished incompletely but marked as success since there were 
> not any  exceptions and failed or killed tasks.
> When I checked the query result, it missed about 4000 records.
> The history web ui shows that the task input size is 23.5 MB, but the log in 
> the executor shows the split size is 326992763, about 300 MB.
> And this task was finished in 1 second, but others’ duration was about 15 
> seconds.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27663) Task accomplished incompletely but marked as success

2019-05-08 Thread Fan Yunbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Yunbo updated SPARK-27663:
--
Attachment: incomplte-task-0.png

> Task accomplished incompletely but marked as success
> 
>
> Key: SPARK-27663
> URL: https://issues.apache.org/jira/browse/SPARK-27663
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Fan Yunbo
>Priority: Major
> Attachments: image-2019-05-09-11-10-04-602.png, incomplte-task-0.png, 
> incomplte-task-1.png, incomplte-task-2.png
>
>
> It happens when running sql queries using spark sql.
> The task was accomplished incompletely but marked as success since there were 
> not any  exceptions and failed or killed tasks.
> When I checked the query result, it missed about 4000 records.
> The history web ui shows that the task input size is 23.5 MB, but the log in 
> the executor shows the split size is 326992763, about 300 MB.
> And this task was finished in 1 second, but others’ duration was about 15 
> seconds.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27663) Task accomplished incompletely but marked as success

2019-05-08 Thread Fan Yunbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Yunbo updated SPARK-27663:
--
Attachment: image-2019-05-09-11-10-04-602.png

> Task accomplished incompletely but marked as success
> 
>
> Key: SPARK-27663
> URL: https://issues.apache.org/jira/browse/SPARK-27663
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Fan Yunbo
>Priority: Major
> Attachments: image-2019-05-09-11-10-04-602.png, incomplte-task-1.png, 
> incomplte-task-2.png
>
>
> It happens when running sql queries using spark sql.
> The task was accomplished incompletely but marked as success since there were 
> not any  exceptions and failed or killed tasks.
> When I checked the query result, it missed about 4000 records.
> The history web ui shows that the task input size is 23.5 MB, but the log in 
> the executor shows the split size is 326992763, about 300 MB.
> And this task was finished in 1 second, but others’ duration was about 15 
> seconds.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27663) Task accomplished incompletely but marked as success

2019-05-08 Thread Fan Yunbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Yunbo updated SPARK-27663:
--
Attachment: incomplte-task-2.png
incomplte-task-1.png

> Task accomplished incompletely but marked as success
> 
>
> Key: SPARK-27663
> URL: https://issues.apache.org/jira/browse/SPARK-27663
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Fan Yunbo
>Priority: Major
> Attachments: incomplte-task-1.png, incomplte-task-2.png
>
>
> It happens when running sql queries using spark sql.
> The task was accomplished incompletely but marked as success since there were 
> not any  exceptions and failed or killed tasks.
> When I checked the query result, it missed about 4000 records.
> The history web ui shows that the task input size is 23.5 MB, but the log in 
> the executor shows the split size is 326992763, about 300 MB.
> And this task was finished in 1 second, but others’ duration was about 15 
> seconds.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27663) Task accomplished incompletely but marked as success

2019-05-08 Thread Fan Yunbo (JIRA)
Fan Yunbo created SPARK-27663:
-

 Summary: Task accomplished incompletely but marked as success
 Key: SPARK-27663
 URL: https://issues.apache.org/jira/browse/SPARK-27663
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.1.0
Reporter: Fan Yunbo


It happens when running sql queries using spark sql.

The task was accomplished incompletely but marked as success since there were 
not any  exceptions and failed or killed tasks.

When I checked the query result, it missed about 4000 records.

The history web ui shows that the task input size is 23.5 MB, but the log in 
the executor shows the split size is 326992763, about 300 MB.

And this task was finished in 1 second, but others’ duration was about 15 
seconds.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27359) Joins on some array functions can be optimized

2019-05-08 Thread Nikolas Vanderhoof (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolas Vanderhoof updated SPARK-27359:
---
Description: 
I encounter these cases frequently, and implemented the optimization manually 
(as shown here). If others experience this as well, perhaps it would be good to 
add appropriate tree transformations into catalyst. 

h1. Case 1
A join like this:
{code:scala}
left.join(
  right,
  arrays_overlap(left("a"), right("b")) // Creates a cartesian product in 
the logical plan
)
{code}

will produce the same results as:
{code:scala}
{
  val leftPrime = left.withColumn("exploded_a", explode(col("a")))
  val rightPrime = right.withColumn("exploded_b", explode(col("b")))

  leftPrime.join(
rightPrime,
leftPrime("exploded_a") === rightPrime("exploded_b")
  // Equijoin doesn't produce cartesian product
  ).drop("exploded_a", "exploded_b").distinct
}
{code}

h1. Case 2
A join like this:
{code:scala}
left.join(
  right,
  array_contains(left("arr"), right("value")) // Cartesian product in logical 
plan
)
{code}

will produce the same results as:
{code:scala}
{
  val leftPrime = left.withColumn("exploded_arr", explode(col("arr")))

  leftPrime.join(
right,
leftPrime("exploded_arr") === right("value") // Fast equijoin
  ).drop("exploded_arr").distinct
}
{code}

h1. Case 3
A join like this:
{code:scala}
left.join(
  right,
  array_contains(right("arr"), left("value")) // Cartesian product in logical 
plan
)
{code}

will produce the same results as:
{code:scala}
{
  val rightPrime = right.withColumn("exploded_arr", explode(col("arr")))

  left.join(
rightPrime,
left("value") === rightPrime("exploded_arr") // Fast equijoin
  ).drop("exploded_arr").distinct
}
{code}

  was:
I encounter these cases frequently, and implemented the optimization manually 
(as shown here). If others experience this as well, perhaps it would be good to 
add appropriate tree transformations into catalyst. I can create some rough 
draft implementations but expect I will need assistance when it comes to 
resolving the generating expressions in the logical plan.

h1. Case 1
A join like this:
{code:scala}
left.join(
  right,
  arrays_overlap(left("a"), right("b")) // Creates a cartesian product in 
the logical plan
)
{code}

will produce the same results as:
{code:scala}
{
  val leftPrime = left.withColumn("exploded_a", explode(col("a")))
  val rightPrime = right.withColumn("exploded_b", explode(col("b")))

  leftPrime.join(
rightPrime,
leftPrime("exploded_a") === rightPrime("exploded_b")
  // Equijoin doesn't produce cartesian product
  ).drop("exploded_a", "exploded_b").distinct
}
{code}

h1. Case 2
A join like this:
{code:scala}
left.join(
  right,
  array_contains(left("arr"), right("value")) // Cartesian product in logical 
plan
)
{code}

will produce the same results as:
{code:scala}
{
  val leftPrime = left.withColumn("exploded_arr", explode(col("arr")))

  leftPrime.join(
right,
leftPrime("exploded_arr") === right("value") // Fast equijoin
  ).drop("exploded_arr").distinct
}
{code}

h1. Case 3
A join like this:
{code:scala}
left.join(
  right,
  array_contains(right("arr"), left("value")) // Cartesian product in logical 
plan
)
{code}

will produce the same results as:
{code:scala}
{
  val rightPrime = right.withColumn("exploded_arr", explode(col("arr")))

  left.join(
rightPrime,
left("value") === rightPrime("exploded_arr") // Fast equijoin
  ).drop("exploded_arr").distinct
}
{code}


> Joins on some array functions can be optimized
> --
>
> Key: SPARK-27359
> URL: https://issues.apache.org/jira/browse/SPARK-27359
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.0.0
>Reporter: Nikolas Vanderhoof
>Priority: Minor
>
> I encounter these cases frequently, and implemented the optimization manually 
> (as shown here). If others experience this as well, perhaps it would be good 
> to add appropriate tree transformations into catalyst. 
> h1. Case 1
> A join like this:
> {code:scala}
> left.join(
>   right,
>   arrays_overlap(left("a"), right("b")) // Creates a cartesian product in 
> the logical plan
> )
> {code}
> will produce the same results as:
> {code:scala}
> {
>   val leftPrime = left.withColumn("exploded_a", explode(col("a")))
>   val rightPrime = right.withColumn("exploded_b", explode(col("b")))
>   leftPrime.join(
> rightPrime,
> leftPrime("exploded_a") === rightPrime("exploded_b")
>   // Equijoin doesn't produce cartesian product
>   ).drop("exploded_a", "exploded_b").distinct
> }
> {code}
> h1. Case 2
> A join like this:
> {code:scala}
> left.join(
>   right,
>   array_contains(left("arr"), right("value")) // Cartesian product in logical 
> plan
> )
> {code}
> will 

[jira] [Assigned] (SPARK-27359) Joins on some array functions can be optimized

2019-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27359:


Assignee: Apache Spark

> Joins on some array functions can be optimized
> --
>
> Key: SPARK-27359
> URL: https://issues.apache.org/jira/browse/SPARK-27359
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.0.0
>Reporter: Nikolas Vanderhoof
>Assignee: Apache Spark
>Priority: Minor
>
> I encounter these cases frequently, and implemented the optimization manually 
> (as shown here). If others experience this as well, perhaps it would be good 
> to add appropriate tree transformations into catalyst. 
> h1. Case 1
> A join like this:
> {code:scala}
> left.join(
>   right,
>   arrays_overlap(left("a"), right("b")) // Creates a cartesian product in 
> the logical plan
> )
> {code}
> will produce the same results as:
> {code:scala}
> {
>   val leftPrime = left.withColumn("exploded_a", explode(col("a")))
>   val rightPrime = right.withColumn("exploded_b", explode(col("b")))
>   leftPrime.join(
> rightPrime,
> leftPrime("exploded_a") === rightPrime("exploded_b")
>   // Equijoin doesn't produce cartesian product
>   ).drop("exploded_a", "exploded_b").distinct
> }
> {code}
> h1. Case 2
> A join like this:
> {code:scala}
> left.join(
>   right,
>   array_contains(left("arr"), right("value")) // Cartesian product in logical 
> plan
> )
> {code}
> will produce the same results as:
> {code:scala}
> {
>   val leftPrime = left.withColumn("exploded_arr", explode(col("arr")))
>   leftPrime.join(
> right,
> leftPrime("exploded_arr") === right("value") // Fast equijoin
>   ).drop("exploded_arr").distinct
> }
> {code}
> h1. Case 3
> A join like this:
> {code:scala}
> left.join(
>   right,
>   array_contains(right("arr"), left("value")) // Cartesian product in logical 
> plan
> )
> {code}
> will produce the same results as:
> {code:scala}
> {
>   val rightPrime = right.withColumn("exploded_arr", explode(col("arr")))
>   left.join(
> rightPrime,
> left("value") === rightPrime("exploded_arr") // Fast equijoin
>   ).drop("exploded_arr").distinct
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27359) Joins on some array functions can be optimized

2019-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27359:


Assignee: (was: Apache Spark)

> Joins on some array functions can be optimized
> --
>
> Key: SPARK-27359
> URL: https://issues.apache.org/jira/browse/SPARK-27359
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.0.0
>Reporter: Nikolas Vanderhoof
>Priority: Minor
>
> I encounter these cases frequently, and implemented the optimization manually 
> (as shown here). If others experience this as well, perhaps it would be good 
> to add appropriate tree transformations into catalyst. 
> h1. Case 1
> A join like this:
> {code:scala}
> left.join(
>   right,
>   arrays_overlap(left("a"), right("b")) // Creates a cartesian product in 
> the logical plan
> )
> {code}
> will produce the same results as:
> {code:scala}
> {
>   val leftPrime = left.withColumn("exploded_a", explode(col("a")))
>   val rightPrime = right.withColumn("exploded_b", explode(col("b")))
>   leftPrime.join(
> rightPrime,
> leftPrime("exploded_a") === rightPrime("exploded_b")
>   // Equijoin doesn't produce cartesian product
>   ).drop("exploded_a", "exploded_b").distinct
> }
> {code}
> h1. Case 2
> A join like this:
> {code:scala}
> left.join(
>   right,
>   array_contains(left("arr"), right("value")) // Cartesian product in logical 
> plan
> )
> {code}
> will produce the same results as:
> {code:scala}
> {
>   val leftPrime = left.withColumn("exploded_arr", explode(col("arr")))
>   leftPrime.join(
> right,
> leftPrime("exploded_arr") === right("value") // Fast equijoin
>   ).drop("exploded_arr").distinct
> }
> {code}
> h1. Case 3
> A join like this:
> {code:scala}
> left.join(
>   right,
>   array_contains(right("arr"), left("value")) // Cartesian product in logical 
> plan
> )
> {code}
> will produce the same results as:
> {code:scala}
> {
>   val rightPrime = right.withColumn("exploded_arr", explode(col("arr")))
>   left.join(
> rightPrime,
> left("value") === rightPrime("exploded_arr") // Fast equijoin
>   ).drop("exploded_arr").distinct
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27662) SQL tab shows two jobs for one SQL command

2019-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27662:


Assignee: (was: Apache Spark)

> SQL tab shows two jobs for one SQL command
> --
>
> Key: SPARK-27662
> URL: https://issues.apache.org/jira/browse/SPARK-27662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
>  !screenshot-1.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27662) SQL tab shows two jobs for one SQL command

2019-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27662:


Assignee: Apache Spark

> SQL tab shows two jobs for one SQL command
> --
>
> Key: SPARK-27662
> URL: https://issues.apache.org/jira/browse/SPARK-27662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
> Attachments: screenshot-1.png
>
>
>  !screenshot-1.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27627) Make option "pathGlobFilter" as a general option for all file sources

2019-05-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27627.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24518
[https://github.com/apache/spark/pull/24518]

> Make option "pathGlobFilter" as a general option for all file sources
> -
>
> Key: SPARK-27627
> URL: https://issues.apache.org/jira/browse/SPARK-27627
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Background:
> The data source option "pathGlobFilter" is introduced for Binary file format: 
> https://github.com/apache/spark/pull/24354 , which can be used for filtering 
> file names, e.g. reading "*.png" files only while there is "*.json" files in 
> the same directory.
> Proposal:
> Make the option "pathGlobFilter" as a general option for all file sources. 
> The path filtering should happen in the path globbing on Driver.
> Motivation:
> Filtering the file path names in file scan tasks on executors is kind of 
> ugly. 
> Impact:
> 1. The splitting of file partitions will be more balanced.
> 2. The metrics of file scan will be more accurate.
> 3. Users can use the option for reading other file sources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27627) Make option "pathGlobFilter" as a general option for all file sources

2019-05-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27627:


Assignee: Gengliang Wang

> Make option "pathGlobFilter" as a general option for all file sources
> -
>
> Key: SPARK-27627
> URL: https://issues.apache.org/jira/browse/SPARK-27627
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Background:
> The data source option "pathGlobFilter" is introduced for Binary file format: 
> https://github.com/apache/spark/pull/24354 , which can be used for filtering 
> file names, e.g. reading "*.png" files only while there is "*.json" files in 
> the same directory.
> Proposal:
> Make the option "pathGlobFilter" as a general option for all file sources. 
> The path filtering should happen in the path globbing on Driver.
> Motivation:
> Filtering the file path names in file scan tasks on executors is kind of 
> ugly. 
> Impact:
> 1. The splitting of file partitions will be more balanced.
> 2. The metrics of file scan will be more accurate.
> 3. Users can use the option for reading other file sources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27662) SQL tab shows two jobs for one SQL command

2019-05-08 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27662:

Description:  !screenshot-1.png! 

> SQL tab shows two jobs for one SQL command
> --
>
> Key: SPARK-27662
> URL: https://issues.apache.org/jira/browse/SPARK-27662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
>  !screenshot-1.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27662) SQL tab shows two jobs for one SQL command

2019-05-08 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27662:

Attachment: screenshot-1.png

> SQL tab shows two jobs for one SQL command
> --
>
> Key: SPARK-27662
> URL: https://issues.apache.org/jira/browse/SPARK-27662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27662) SQL tab shows two jobs for one SQL command

2019-05-08 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-27662:
---

 Summary: SQL tab shows two jobs for one SQL command
 Key: SPARK-27662
 URL: https://issues.apache.org/jira/browse/SPARK-27662
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang
 Attachments: screenshot-1.png





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26130) Change Event Timeline Display Functionality on the Stages Page to use either REST API or data from other tables

2019-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26130:


Assignee: (was: Apache Spark)

> Change Event Timeline Display Functionality on the Stages Page to use either 
> REST API or data from other tables
> ---
>
> Key: SPARK-26130
> URL: https://issues.apache.org/jira/browse/SPARK-26130
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> As per Jira https://issues.apache.org/jira/browse/SPARK-21809, Stages page 
> will use datatables for performing Column sorting, searching, pagination etc. 
> To support those datatables, we have changed the Stage page to use ajax calls 
> to access the server API's. However, event timeline functionality on the 
> stage page has not been updated to use the REST API or use data from the 
> datatables dynamically to reconstruct the graphs at the Client end.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26130) Change Event Timeline Display Functionality on the Stages Page to use either REST API or data from other tables

2019-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26130:


Assignee: Apache Spark

> Change Event Timeline Display Functionality on the Stages Page to use either 
> REST API or data from other tables
> ---
>
> Key: SPARK-26130
> URL: https://issues.apache.org/jira/browse/SPARK-26130
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Parth Gandhi
>Assignee: Apache Spark
>Priority: Minor
>
> As per Jira https://issues.apache.org/jira/browse/SPARK-21809, Stages page 
> will use datatables for performing Column sorting, searching, pagination etc. 
> To support those datatables, we have changed the Stage page to use ajax calls 
> to access the server API's. However, event timeline functionality on the 
> stage page has not been updated to use the REST API or use data from the 
> datatables dynamically to reconstruct the graphs at the Client end.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27624) Fix CalenderInterval to show an empty interval correctly

2019-05-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27624.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.4.4
   3.0.0
   2.3.4

This is resolved via https://github.com/apache/spark/pull/24516

> Fix CalenderInterval to show an empty interval correctly
> 
>
> Key: SPARK-27624
> URL: https://issues.apache.org/jira/browse/SPARK-27624
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.2, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.3.4, 3.0.0, 2.4.4
>
>
> *BEFORE*
> {code}
> scala> spark.readStream.schema("ts 
> timestamp").parquet("/tmp/t").withWatermark("ts", "1 microsecond").explain
> == Physical Plan ==
> EventTimeWatermark ts#0: timestamp, interval 1 microseconds
> +- StreamingRelation FileSource[/tmp/t], [ts#0]
> scala> spark.readStream.schema("ts 
> timestamp").parquet("/tmp/t").withWatermark("ts", "0 microsecond").explain
> == Physical Plan ==
> EventTimeWatermark ts#3: timestamp, interval
> +- StreamingRelation FileSource[/tmp/t], [ts#3]
> {code}
> *AFTER*
> {code}
> scala> spark.readStream.schema("ts 
> timestamp").parquet("/tmp/t").withWatermark("ts", "1 microsecond").explain
> == Physical Plan ==
> EventTimeWatermark ts#0: timestamp, interval 1 microseconds
> +- StreamingRelation FileSource[/tmp/t], [ts#0]
> scala> spark.readStream.schema("ts 
> timestamp").parquet("/tmp/t").withWatermark("ts", "0 microsecond").explain
> == Physical Plan ==
> EventTimeWatermark ts#3: timestamp, interval 0 microseconds
> +- StreamingRelation FileSource[/tmp/t], [ts#3]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27661) Add SupportsNamespaces interface for v2 catalogs

2019-05-08 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835921#comment-16835921
 ] 

Dongjoon Hyun commented on SPARK-27661:
---

Ur, it seems that something went wrong. I don't know why this is assigned to 
me. Sorry, [~rdblue].

> Add SupportsNamespaces interface for v2 catalogs
> 
>
> Key: SPARK-27661
> URL: https://issues.apache.org/jira/browse/SPARK-27661
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Priority: Major
>
> Some catalogs support namespace operations, like creating or dropping 
> namespaces. The v2 API should have a way to expose these operations to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27661) Add SupportsNamespaces interface for v2 catalogs

2019-05-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-27661:
-

Assignee: (was: Dongjoon Hyun)

> Add SupportsNamespaces interface for v2 catalogs
> 
>
> Key: SPARK-27661
> URL: https://issues.apache.org/jira/browse/SPARK-27661
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Priority: Major
>
> Some catalogs support namespace operations, like creating or dropping 
> namespaces. The v2 API should have a way to expose these operations to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27661) Add SupportsNamespaces interface for v2 catalogs

2019-05-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-27661:
-

Assignee: Dongjoon Hyun

> Add SupportsNamespaces interface for v2 catalogs
> 
>
> Key: SPARK-27661
> URL: https://issues.apache.org/jira/browse/SPARK-27661
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Some catalogs support namespace operations, like creating or dropping 
> namespaces. The v2 API should have a way to expose these operations to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25139) PythonRunner#WriterThread released block after TaskRunner finally block which invoke BlockManager#releaseAllLocksForTask

2019-05-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25139:
--
Fix Version/s: 2.4.4

> PythonRunner#WriterThread released block after TaskRunner finally block which 
>  invoke BlockManager#releaseAllLocksForTask
> -
>
> Key: SPARK-25139
> URL: https://issues.apache.org/jira/browse/SPARK-25139
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.1
>Reporter: DENG FEI
>Assignee: Xingbo Jiang
>Priority: Major
> Fix For: 3.0.0, 2.4.4
>
>
> We run pyspark streaming on YARN, the executor will die caused by the error: 
> the task released lock while finished, but the python writer haven't do real 
> releasing lock.
> Normally the task just double check the lock, but it ran wrong in front.
> The executor trace log is below:
>  18/08/17 13:52:20 Executor task launch worker for task 137 DEBUG 
> BlockManager: Getting local block input-0-1534485138800 18/08/17 13:52:20 
> Executor task launch worker for task 137 TRACE BlockInfoManager: Task 137 
> trying to acquire read lock for input-0-1534485138800 18/08/17 13:52:20 
> Executor task launch worker for task 137 TRACE BlockInfoManager: Task 137 
> acquired read lock for input-0-1534485138800 18/08/17 13:52:20 Executor task 
> launch worker for task 137 DEBUG BlockManager: Level for block 
> input-0-1534485138800 is StorageLevel(disk, memory, 1 replicas) 18/08/17 
> 13:52:20 Executor task launch worker for task 137 INFO BlockManager: Found 
> block input-0-1534485138800 locally 18/08/17 13:52:20 Executor task launch 
> worker for task 137 INFO PythonRunner: Times: total = 8, boot = 3, init = 5, 
> finish = 0 18/08/17 13:52:20 stdout writer for python TRACE BlockInfoManager: 
> Task 137 releasing lock for input-0-1534485138800 18/08/17 13:52:20 Executor 
> task launch worker for task 137 INFO Executor: 1 block locks were not 
> released by TID = 137: [input-0-1534485138800] 18/08/17 13:52:20 stdout 
> writer for python ERROR Utils: Uncaught exception in thread stdout writer for 
> python java.lang.AssertionError: assertion failed: Block 
> input-0-1534485138800 is not locked for reading at 
> scala.Predef$.assert(Predef.scala:170) at 
> org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299) 
> at org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:769) 
> at 
> org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:540)
>  at 
> org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:44)
>  at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:33) 
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:213)
>  at 
> org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:407)
>  at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991) at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)
>  18/08/17 13:52:20 stdout writer for python ERROR 
> SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[stdout 
> writer for python,5,main]
>  
> I think shoud wait WriterThread after Task#run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27660) Allow PySpark toLocalIterator to pre-fetch data

2019-05-08 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-27660.
--
Resolution: Duplicate

Somehow this issue got created twice

> Allow PySpark toLocalIterator to pre-fetch data
> ---
>
> Key: SPARK-27660
> URL: https://issues.apache.org/jira/browse/SPARK-27660
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> In SPARK-23961, data was no longer prefetched so that the local iterator 
> could close cleanly in the case of not consuming all of the data. If the user 
> intends to iterate over all elements, then prefetching data could bring back 
> any lost performance. We would need to run some tests to see if the 
> performance is worth it due to additional complexity and extra usage of 
> memory. The option to prefetch could adding as a user conf, and it's possible 
> this could improve the Scala toLocalIterator also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27659) Allow PySpark toLocalIterator to prefetch data

2019-05-08 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-27659:
-
Summary: Allow PySpark toLocalIterator to prefetch data  (was: Allow 
PySpark toLocalIterator to pre-fetch data)

> Allow PySpark toLocalIterator to prefetch data
> --
>
> Key: SPARK-27659
> URL: https://issues.apache.org/jira/browse/SPARK-27659
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> In SPARK-23961, data was no longer prefetched so that the local iterator 
> could close cleanly in the case of not consuming all of the data. If the user 
> intends to iterate over all elements, then prefetching data could bring back 
> any lost performance. We would need to run some tests to see if the 
> performance is worth it due to additional complexity and extra usage of 
> memory. The option to prefetch could adding as a user conf, and it's possible 
> this could improve the Scala toLocalIterator also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-27660) Allow PySpark toLocalIterator to pre-fetch data

2019-05-08 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler closed SPARK-27660.


> Allow PySpark toLocalIterator to pre-fetch data
> ---
>
> Key: SPARK-27660
> URL: https://issues.apache.org/jira/browse/SPARK-27660
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> In SPARK-23961, data was no longer prefetched so that the local iterator 
> could close cleanly in the case of not consuming all of the data. If the user 
> intends to iterate over all elements, then prefetching data could bring back 
> any lost performance. We would need to run some tests to see if the 
> performance is worth it due to additional complexity and extra usage of 
> memory. The option to prefetch could adding as a user conf, and it's possible 
> this could improve the Scala toLocalIterator also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27661) Add SupportsNamespaces interface for v2 catalogs

2019-05-08 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27661:
-

 Summary: Add SupportsNamespaces interface for v2 catalogs
 Key: SPARK-27661
 URL: https://issues.apache.org/jira/browse/SPARK-27661
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3
Reporter: Ryan Blue


Some catalogs support namespace operations, like creating or dropping 
namespaces. The v2 API should have a way to expose these operations to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27657) ml.util.Instrumentation.logFailure doesn't log error message

2019-05-08 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835780#comment-16835780
 ] 

Xiangrui Meng commented on SPARK-27657:
---

This is how JDK format the error string: 
https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/lang/Throwable.java#L654

> ml.util.Instrumentation.logFailure doesn't log error message
> 
>
> Key: SPARK-27657
> URL: https://issues.apache.org/jira/browse/SPARK-27657
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Xiangrui Meng
>Priority: Major
>
> It only gets the stack trace without the error message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27657) ml.util.Instrumentation.logFailure doesn't log error message

2019-05-08 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835780#comment-16835780
 ] 

Xiangrui Meng edited comment on SPARK-27657 at 5/8/19 5:42 PM:
---

This is how JDK format an exception: 
https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/lang/Throwable.java#L654


was (Author: mengxr):
This is how JDK format the error string: 
https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/lang/Throwable.java#L654

> ml.util.Instrumentation.logFailure doesn't log error message
> 
>
> Key: SPARK-27657
> URL: https://issues.apache.org/jira/browse/SPARK-27657
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Xiangrui Meng
>Priority: Major
>
> It only gets the stack trace without the error message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27657) ml.util.Instrumentation.logFailure doesn't log error message

2019-05-08 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835779#comment-16835779
 ] 

Xiangrui Meng commented on SPARK-27657:
---

[~mrbago] Can you send a PR to fix it?

> ml.util.Instrumentation.logFailure doesn't log error message
> 
>
> Key: SPARK-27657
> URL: https://issues.apache.org/jira/browse/SPARK-27657
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Xiangrui Meng
>Priority: Major
>
> It only gets the stack trace without the error message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27660) Allow PySpark toLocalIterator to pre-fetch data

2019-05-08 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-27660:


 Summary: Allow PySpark toLocalIterator to pre-fetch data
 Key: SPARK-27660
 URL: https://issues.apache.org/jira/browse/SPARK-27660
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Bryan Cutler


In SPARK-23961, data was no longer prefetched so that the local iterator could 
close cleanly in the case of not consuming all of the data. If the user intends 
to iterate over all elements, then prefetching data could bring back any lost 
performance. We would need to run some tests to see if the performance is worth 
it due to additional complexity and extra usage of memory. The option to 
prefetch could adding as a user conf, and it's possible this could improve the 
Scala toLocalIterator also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27659) Allow PySpark toLocalIterator to pre-fetch data

2019-05-08 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-27659:


 Summary: Allow PySpark toLocalIterator to pre-fetch data
 Key: SPARK-27659
 URL: https://issues.apache.org/jira/browse/SPARK-27659
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Bryan Cutler


In SPARK-23961, data was no longer prefetched so that the local iterator could 
close cleanly in the case of not consuming all of the data. If the user intends 
to iterate over all elements, then prefetching data could bring back any lost 
performance. We would need to run some tests to see if the performance is worth 
it due to additional complexity and extra usage of memory. The option to 
prefetch could adding as a user conf, and it's possible this could improve the 
Scala toLocalIterator also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27658) Catalog API to load functions

2019-05-08 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27658:
-

 Summary: Catalog API to load functions
 Key: SPARK-27658
 URL: https://issues.apache.org/jira/browse/SPARK-27658
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3
Reporter: Ryan Blue


SPARK-24252 added an API that catalog plugins can implement to expose table 
operations. Catalogs should also be able to provide function implementations to 
Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27657) ml.util.Instrumentation.logFailure doesn't log error message

2019-05-08 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-27657:
-

 Summary: ml.util.Instrumentation.logFailure doesn't log error 
message
 Key: SPARK-27657
 URL: https://issues.apache.org/jira/browse/SPARK-27657
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.4.3, 3.0.0
Reporter: Xiangrui Meng


It only gets the stack trace without the error message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time

2019-05-08 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835751#comment-16835751
 ] 

Jungtaek Lim commented on SPARK-27648:
--

[~yy3b2007com]

FYI, you can monitor the (approximate) memory usage of state across executors 
via attaching streaming query listener.

[http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#reporting-metrics-programmatically-using-asynchronous-apis]

{{QueryProgressEvent}} will contain "state" information which shows number of 
keys and the memory usage of the last version, the memory usage of the all 
versions in memory.

So if you wrote the information to somewhere like file or Kafka topic and 
monitor the information, you will get memory growth for state as well, and 
determine whether growing state is the root issue or not.

> In Spark2.4 Structured Streaming:The executor storage memory increasing over 
> time
> -
>
> Key: SPARK-27648
> URL: https://issues.apache.org/jira/browse/SPARK-27648
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: tommy duan
>Priority: Major
>
> *Spark Program Code Business:*
>  Read the topic on kafka, aggregate the stream data sources, and then output 
> it to another topic line of kafka.
> *Problem Description:*
>  *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
> overflow problems often occur (because of too many versions of state stored 
> in memory, this bug has been modified in spark 2.4).
> {code:java}
> /spark-submit \
> --conf “spark.yarn.executor.memoryOverhead=4096M”
> --num-executors 15 \
> --executor-memory 3G \
> --executor-cores 2 \
> --driver-memory 6G \{code}
> {code}
> Executor memory exceptions occur when running with this submit resource under 
> SPARK 2.2 and the normal running time does not exceed one day.
> The solution is to set the executor memory larger than before 
> {code:java}
>  My spark-submit script is as follows:
> /spark-submit\
> conf "spark. yarn. executor. memoryOverhead = 4096M"
> num-executors 15\
> executor-memory 46G\
> executor-cores 3\
> driver-memory 6G\
> ...{code}
> In this case, the spark program can be guaranteed to run stably for a long 
> time, and the executor storage memory is less than 10M (it has been running 
> stably for more than 20 days).
> *2) From the upgrade information of Spark 2.4, we can see that the problem of 
> large memory consumption of state storage has been solved in Spark 2.4.* 
>  So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, 
> and found that the use of memory was reduced.
>  But a problem arises, as the running time increases, the storage memory of 
> executor is growing (see Executors - > Storage Memory from the Spark on Yarn 
> Resource Manager UI).
>  This program has been running for 14 days (under SPARK 2.2, running with 
> this submit resource, the normal running time is not more than one day, 
> Executor memory abnormalities will occur).
>  The script submitted by the program under spark2.4 is as follows:
> {code:java}
> /spark-submit \
>  --conf “spark.yarn.executor.memoryOverhead=4096M”
>  --num-executors 15 \
>  --executor-memory 3G \
>  --executor-cores 2 \
>  --driver-memory 6G 
> {code}
> Under Spark 2.4, I counted the size of executor memory as time went by during 
> the running of the spark program:
> |Run-time(hour)|Storage Memory size(MB)|Memory growth rate(MB/hour)|
> |23.5H|41.6MB/1.5GB|1.770212766|
> |108.4H|460.2MB/1.5GB|4.245387454|
> |131.7H|559.1MB/1.5GB|4.245254366|
> |135.4H|575MB/1.5GB|4.246676514|
> |153.6H|641.2MB/1.5GB|4.174479167|
> |219H|888.1MB/1.5GB|4.055251142|
> |263H|1126.4MB/1.5GB|4.282889734|
> |309H|1228.8MB/1.5GB|3.976699029|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23098) Migrate Kafka batch source to v2

2019-05-08 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835700#comment-16835700
 ] 

Ryan Blue commented on SPARK-23098:
---

I don't think there's a DSv2-related obstacle to implementing this.

> Migrate Kafka batch source to v2
> 
>
> Key: SPARK-23098
> URL: https://issues.apache.org/jira/browse/SPARK-23098
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23098) Migrate Kafka batch source to v2

2019-05-08 Thread Dylan Guedes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835693#comment-16835693
 ] 

Dylan Guedes commented on SPARK-23098:
--

[~gsomogyi] No, I didn't picked this one. Whatever, happy to see you interested 
on it! 

> Migrate Kafka batch source to v2
> 
>
> Key: SPARK-23098
> URL: https://issues.apache.org/jira/browse/SPARK-23098
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23098) Migrate Kafka batch source to v2

2019-05-08 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835684#comment-16835684
 ] 

Gabor Somogyi commented on SPARK-23098:
---

[~rdblue] since you're heavily involved in the DSv2 implementation I would like 
to ask you whether you see any obstacle to implement this?
Happy to pick this up.


> Migrate Kafka batch source to v2
> 
>
> Key: SPARK-23098
> URL: https://issues.apache.org/jira/browse/SPARK-23098
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25861) Remove unused refreshInterval parameter from the headerSparkPage method.

2019-05-08 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835598#comment-16835598
 ] 

shahid commented on SPARK-25861:


[~drKalko]Normally Browser supports refresh page.
This JIRA was for removing unused code from spark 

> Remove unused refreshInterval parameter from the headerSparkPage method.
> 
>
> Key: SPARK-25861
> URL: https://issues.apache.org/jira/browse/SPARK-25861
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: shahid
>Assignee: shahid
>Priority: Trivial
> Fix For: 3.0.0
>
>
> https://github.com/apache/spark/blob/d5573c578a1eea9ee04886d9df37c7178e67bb30/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L221
>  
> refreshInterval is not used anywhere in the headerSparkPage method. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27641) Unregistering a single Metrics Source with no metrics leads to removing all the metrics from other sources with the same name

2019-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27641:


Assignee: Apache Spark

> Unregistering a single Metrics Source with no metrics leads to removing all 
> the metrics from other sources with the same name
> -
>
> Key: SPARK-27641
> URL: https://issues.apache.org/jira/browse/SPARK-27641
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.3, 2.3.3, 2.4.2
>Reporter: Sergey Zhemzhitsky
>Assignee: Apache Spark
>Priority: Major
>
> Currently Spark allows registering multiple Metric Sources with the same 
> source name like the following
> {code:scala}
> val acc1 = sc.longAccumulator
> LongAccumulatorSource.register(sc, {"acc1" -> acc1})
> val acc2 = sc.longAccumulator
> LongAccumulatorSource.register(sc, {"acc2" -> acc2})
> {code}
> In that case there are two metric sources registered and both of these 
> sources have the same name - 
> [AccumulatorSource|https://github.com/apache/spark/blob/6ef45301a46c47c12fbc74bb9ceaffea685ed944/core/src/main/scala/org/apache/spark/metrics/source/AccumulatorSource.scala#L47]
> If you try to unregister the source with no accumulators and metrics 
> registered like the following
> {code:scala}
> SparkEnv.get.metricsSystem.removeSource(new LongAccumulatorSource)
> {code}
> ... then all the metrics for all the sources with the same name will be 
> unregistered because of the 
> [following|https://github.com/apache/spark/blob/6ef45301a46c47c12fbc74bb9ceaffea685ed944/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L171]
>  snippet which removes all matching records which start with the 
> corresponding prefix which includes the source name, but does not include 
> metric name to be removed.
> {code:scala}
> def removeSource(source: Source) {
>   sources -= source
>   val regName = buildRegistryName(source)
>   registry.removeMatching((name: String, _: Metric) => 
> name.startsWith(regName))
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27641) Unregistering a single Metrics Source with no metrics leads to removing all the metrics from other sources with the same name

2019-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27641:


Assignee: (was: Apache Spark)

> Unregistering a single Metrics Source with no metrics leads to removing all 
> the metrics from other sources with the same name
> -
>
> Key: SPARK-27641
> URL: https://issues.apache.org/jira/browse/SPARK-27641
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.3, 2.3.3, 2.4.2
>Reporter: Sergey Zhemzhitsky
>Priority: Major
>
> Currently Spark allows registering multiple Metric Sources with the same 
> source name like the following
> {code:scala}
> val acc1 = sc.longAccumulator
> LongAccumulatorSource.register(sc, {"acc1" -> acc1})
> val acc2 = sc.longAccumulator
> LongAccumulatorSource.register(sc, {"acc2" -> acc2})
> {code}
> In that case there are two metric sources registered and both of these 
> sources have the same name - 
> [AccumulatorSource|https://github.com/apache/spark/blob/6ef45301a46c47c12fbc74bb9ceaffea685ed944/core/src/main/scala/org/apache/spark/metrics/source/AccumulatorSource.scala#L47]
> If you try to unregister the source with no accumulators and metrics 
> registered like the following
> {code:scala}
> SparkEnv.get.metricsSystem.removeSource(new LongAccumulatorSource)
> {code}
> ... then all the metrics for all the sources with the same name will be 
> unregistered because of the 
> [following|https://github.com/apache/spark/blob/6ef45301a46c47c12fbc74bb9ceaffea685ed944/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L171]
>  snippet which removes all matching records which start with the 
> corresponding prefix which includes the source name, but does not include 
> metric name to be removed.
> {code:scala}
> def removeSource(source: Source) {
>   sources -= source
>   val regName = buildRegistryName(source)
>   registry.removeMatching((name: String, _: Metric) => 
> name.startsWith(regName))
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25861) Remove unused refreshInterval parameter from the headerSparkPage method.

2019-05-08 Thread Artem Kalchenko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835565#comment-16835565
 ] 

Artem Kalchenko commented on SPARK-25861:
-

why remove it instead of adding refresh interval to HTML page?

> Remove unused refreshInterval parameter from the headerSparkPage method.
> 
>
> Key: SPARK-25861
> URL: https://issues.apache.org/jira/browse/SPARK-25861
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: shahid
>Assignee: shahid
>Priority: Trivial
> Fix For: 3.0.0
>
>
> https://github.com/apache/spark/blob/d5573c578a1eea9ee04886d9df37c7178e67bb30/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L221
>  
> refreshInterval is not used anywhere in the headerSparkPage method. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20166) Use XXX for ISO timezone instead of ZZ which is FastDateFormat specific in CSV/JSON time related options

2019-05-08 Thread Shyama (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1683#comment-1683
 ] 

Shyama commented on SPARK-20166:


[~srowen],  any clue of this ? 

[https://stackoverflow.com/questions/56020103/how-to-pass-date-timestamp-as-lowerbound-upperbound-in-spark-sql-2-4-1v]

> Use XXX for ISO timezone instead of ZZ which is FastDateFormat specific in 
> CSV/JSON time related options
> 
>
> Key: SPARK-20166
> URL: https://issues.apache.org/jira/browse/SPARK-20166
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.0
>
>
> We can use {{XXX}} format instead of {{ZZ}}. {{ZZ}} seems a 
> {{FastDateFormat}} specific Please see 
> https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html#iso8601timezone
>  and 
> https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/time/FastDateFormat.html
> {{ZZ}} supports "ISO 8601 extended format time zones" but it seems 
> {{FastDateFormat}} specific option.
> It seems we better replace {{ZZ}} to {{XXX}} because they look use the same 
> strategy - 
> https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L930.
>  
> I also checked the codes and manually debugged it for sure. It seems both 
> cases use the same pattern {code}( Z|(?:[+-]\\d{2}(?::)\\d{2})) {code}.
> Note that this is a fix about documentation not the behaviour change because 
> {{ZZ}} seems invalid date format in {{SimpleDateFormat}} as documented in 
> {{DataFrameReader}}:
> {quote}
>* `timestampFormat` (default `-MM-dd'T'HH:mm:ss.SSSZZ`): sets the 
> string that
>* indicates a timestamp format. Custom date formats follow the formats at
>* `java.text.SimpleDateFormat`. This applies to timestamp type.
> {quote}
> {code}
> scala> new 
> java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
> res4: java.util.Date = Tue Mar 21 20:00:00 KST 2017
> scala>  new 
> java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
> res10: java.util.Date = Tue Mar 21 09:00:00 KST 2017
> scala> new 
> java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
> java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000-11:00"
>   at java.text.DateFormat.parse(DateFormat.java:366)
>   ... 48 elided
> scala>  new 
> java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
> java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000Z"
>   at java.text.DateFormat.parse(DateFormat.java:366)
>   ... 48 elided
> {code}
> {code}
> scala> 
> org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
> res7: java.util.Date = Tue Mar 21 20:00:00 KST 2017
> scala> 
> org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
> res1: java.util.Date = Tue Mar 21 09:00:00 KST 2017
> scala> 
> org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
> res8: java.util.Date = Tue Mar 21 20:00:00 KST 2017
> scala> 
> org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
> res2: java.util.Date = Tue Mar 21 09:00:00 KST 2017
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27649) Unify the way you use 'spark.network.timeout'

2019-05-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27649:
---

Assignee: jiaan.geng

> Unify the way you use 'spark.network.timeout'
> -
>
> Key: SPARK-27649
> URL: https://issues.apache.org/jira/browse/SPARK-27649
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Minor
> Fix For: 3.0.0
>
>
> For historical reasons, structured streaming still has some old way of use
> {code:java}
> spark.network.timeout{code}
> , even though 
> {code:java}
> org.apache.spark.internal.config.Network.NETWORK_TIMEOUT{code}
> is now available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf

2019-05-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27555:


Assignee: Sandeep Katta  (was: Hyukjin Kwon)

> cannot create table by using the hive default fileformat in both 
> hive-site.xml and spark-defaults.conf
> --
>
> Key: SPARK-27555
> URL: https://issues.apache.org/jira/browse/SPARK-27555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Hui WANG
>Assignee: Sandeep Katta
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: Try.pdf
>
>
> *You can see details in attachment called Try.pdf.*
> I already seen https://issues.apache.org/jira/browse/SPARK-17620 
> and https://issues.apache.org/jira/browse/SPARK-18397
> and I check source code of Spark for the change of  set 
> "spark.sql.hive.covertCTAS=true" and then spark will use 
> "spark.sql.sources.default" which is parquet as storage format in "create 
> table as select" scenario.
> But my case is just create table without select. When I set  
> hive.default.fileformat=parquet in hive-site.xml or set  
> spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
> create a table, when i check the hive table, it still use textfile fileformat.
>  
> It seems HiveSerDe gets the value of the hive.default.fileformat parameter 
> from SQLConf
> The parameter values in SQLConf are copied from SparkContext's SparkConf at 
> SparkSession initialization, while the configuration parameters in 
> hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters 
> by SharedState, And all the config with "spark.hadoop" conf are setted to 
> hadoopconfig, so the configuration does not take effect.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17906) MulticlassClassificationEvaluator support target label

2019-05-08 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-17906.
--
Resolution: Not A Problem

> MulticlassClassificationEvaluator support target label
> --
>
> Key: SPARK-17906
> URL: https://issues.apache.org/jira/browse/SPARK-17906
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> In practice, I sometime only focus on metric of one special label.
> For example, in CTR prediction, I usually only mind F1 of positive class.
> In sklearn, this is supported:
> {code}
> >>> from sklearn.metrics import classification_report
> >>> y_true = [0, 1, 2, 2, 2]
> >>> y_pred = [0, 0, 2, 2, 1]
> >>> target_names = ['class 0', 'class 1', 'class 2']
> >>> print(classification_report(y_true, y_pred, target_names=target_names))
>  precisionrecall  f1-score   support
> class 0   0.50  1.00  0.67 1
> class 1   0.00  0.00  0.00 1
> class 2   1.00  0.67  0.80 3
> avg / total   0.70  0.60  0.61 5
> {code}
> Now, ml only support `weightedXXX`. So I think there may be a point to 
> improve.
> The API may be designed like this:
> {code}
> val dataset = ...
> val evaluator = new MulticlassClassificationEvaluator
> evaluator.setMetricName("f1")
> evaluator.evaluate(dataset)   // weightedF1 of all classes
> evaluator.setTarget(0.0).setMetricName("f1")
> evaluator.evaluate(dataset)   // F1 of class "0"
> {code}
> what's your opinion? [~yanboliang][~josephkb][~sethah][~srowen] 
> If this is useful and acceptable, I'm happy to work on this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27641) Unregistering a single Metrics Source with no metrics leads to removing all the metrics from other sources with the same name

2019-05-08 Thread chunpinghe (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835504#comment-16835504
 ] 

chunpinghe commented on SPARK-27641:


i am working on this!

> Unregistering a single Metrics Source with no metrics leads to removing all 
> the metrics from other sources with the same name
> -
>
> Key: SPARK-27641
> URL: https://issues.apache.org/jira/browse/SPARK-27641
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.3, 2.3.3, 2.4.2
>Reporter: Sergey Zhemzhitsky
>Priority: Major
>
> Currently Spark allows registering multiple Metric Sources with the same 
> source name like the following
> {code:scala}
> val acc1 = sc.longAccumulator
> LongAccumulatorSource.register(sc, {"acc1" -> acc1})
> val acc2 = sc.longAccumulator
> LongAccumulatorSource.register(sc, {"acc2" -> acc2})
> {code}
> In that case there are two metric sources registered and both of these 
> sources have the same name - 
> [AccumulatorSource|https://github.com/apache/spark/blob/6ef45301a46c47c12fbc74bb9ceaffea685ed944/core/src/main/scala/org/apache/spark/metrics/source/AccumulatorSource.scala#L47]
> If you try to unregister the source with no accumulators and metrics 
> registered like the following
> {code:scala}
> SparkEnv.get.metricsSystem.removeSource(new LongAccumulatorSource)
> {code}
> ... then all the metrics for all the sources with the same name will be 
> unregistered because of the 
> [following|https://github.com/apache/spark/blob/6ef45301a46c47c12fbc74bb9ceaffea685ed944/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L171]
>  snippet which removes all matching records which start with the 
> corresponding prefix which includes the source name, but does not include 
> metric name to be removed.
> {code:scala}
> def removeSource(source: Source) {
>   sources -= source
>   val regName = buildRegistryName(source)
>   registry.removeMatching((name: String, _: Metric) => 
> name.startsWith(regName))
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22320) ORC should support VectorUDT/MatrixUDT

2019-05-08 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-22320.
--
Resolution: Not A Problem

> ORC should support VectorUDT/MatrixUDT
> --
>
> Key: SPARK-22320
> URL: https://issues.apache.org/jira/browse/SPARK-22320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: zhengruifeng
>Priority: Minor
>
> I save dataframe containing vectors in ORC format, when I read it back, the 
> format is changed.
> {code}
> scala> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.linalg._
> scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, 
> Array(4), Array(1.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), 
> (2,(8,[4],[1.0])))
> scala> val df = data.toDF("i", "vec")
> df: org.apache.spark.sql.DataFrame = [i: int, vec: vector]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,false), 
> StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> scala> df.write.orc("/tmp/123")
> scala> val df2 = spark.sqlContext.read.orc("/tmp/123")
> df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct size: int ... 2 more fields>]
> scala> df2.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(i,IntegerType,true), 
> StructField(vec,StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true)),true))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7008) An implementation of Factorization Machine (LibFM)

2019-05-08 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-7008.
-
Resolution: Not A Problem

> An implementation of Factorization Machine (LibFM)
> --
>
> Key: SPARK-7008
> URL: https://issues.apache.org/jira/browse/SPARK-7008
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: zhengruifeng
>Priority: Major
>  Labels: features
> Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, 
> QQ20150421-2.png
>
>
> An implementation of Factorization Machines based on Scala and Spark MLlib.
> FM is a kind of machine learning algorithm for multi-linear regression, and 
> is widely used for recommendation.
> FM works well in recent years' recommendation competitions.
> Ref:
> http://libfm.org/
> http://doi.acm.org/10.1145/2168752.2168771
> http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16872) Include Gaussian Naive Bayes Classifier

2019-05-08 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-16872.
--
Resolution: Not A Problem

> Include Gaussian Naive Bayes Classifier
> ---
>
> Key: SPARK-16872
> URL: https://issues.apache.org/jira/browse/SPARK-16872
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> I implemented Gaussian NB according to scikit-learn's {{GaussianNB}}.
> In GaussianNB model, the {{theta}} matrix is used to store means and there is 
> a extra {{sigma}} matrix storing the variance of each feature.
> GaussianNB in spark
> {code}
> scala> import org.apache.spark.ml.classification.GaussianNaiveBayes
> import org.apache.spark.ml.classification.GaussianNaiveBayes
> scala> val path = 
> "/Users/zrf/.dev/spark-2.1.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt"
> path: String = 
> /Users/zrf/.dev/spark-2.1.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt
> scala> val data = spark.read.format("libsvm").load(path).persist()
> data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: 
> double, features: vector]
> scala> val gnb = new GaussianNaiveBayes()
> gnb: org.apache.spark.ml.classification.GaussianNaiveBayes = gnb_54c50467306c
> scala> val model = gnb.fit(data)
> 17/01/03 14:25:48 INFO Instrumentation: 
> GaussianNaiveBayes-gnb_54c50467306c-720112035-1: training: numPartitions=1 
> storageLevel=StorageLevel(1 replicas)
> 17/01/03 14:25:48 INFO Instrumentation: 
> GaussianNaiveBayes-gnb_54c50467306c-720112035-1: {}
> 17/01/03 14:25:49 INFO Instrumentation: 
> GaussianNaiveBayes-gnb_54c50467306c-720112035-1: {"numFeatures":4}
> 17/01/03 14:25:49 INFO Instrumentation: 
> GaussianNaiveBayes-gnb_54c50467306c-720112035-1: {"numClasses":3}
> 17/01/03 14:25:49 INFO Instrumentation: 
> GaussianNaiveBayes-gnb_54c50467306c-720112035-1: training finished
> model: org.apache.spark.ml.classification.GaussianNaiveBayesModel = 
> GaussianNaiveBayesModel (uid=gnb_54c50467306c) with 3 classes
> scala> model.pi
> res0: org.apache.spark.ml.linalg.Vector = 
> [-1.0986122886681098,-1.0986122886681098,-1.0986122886681098]
> scala> model.pi.toArray.map(math.exp)
> res1: Array[Double] = Array(0., 0., 
> 0.)
> scala> model.theta
> res2: org.apache.spark.ml.linalg.Matrix =
> 0.270067018001   -0.188540006  0.543050720001   0.60546
> -0.60779998  0.18172   -0.842711740006  
> -0.88139998
> -0.091425964 -0.35858001   0.105084738  
> 0.021666701507102017
> scala> model.sigma
> res3: org.apache.spark.ml.linalg.Matrix =
> 0.1223012510889361   0.07078051983960698  0.0343595243976   
> 0.051336071297393815
> 0.03758145300924998  0.09880280046403413  0.003390296940069426  
> 0.007822241779598893
> 0.08058763609659315  0.06701386661293329  0.024866409227781675  
> 0.02661391644759426
> scala> model.transform(data).select("probability").take(10)
> [rdd_68_0]
> res4: Array[org.apache.spark.sql.Row] = 
> Array([[1.0627410543476422E-21,0.9938,6.2765233965353945E-15]], 
> [[7.254521422345374E-26,1.0,1.3849442153180895E-18]], 
> [[1.9629244119173135E-24,0.9998,1.9424765181237926E-16]], 
> [[6.061218297948492E-22,0.9902,9.853216073401884E-15]], 
> [[0.9972225671942837,8.844241161578932E-165,0.002777432805716399]], 
> [[5.361683970373604E-26,1.0,2.3004604508982183E-18]], 
> [[0.01062850630038623,3.3102617689978775E-100,0.9893714936996136]], 
> [[1.9297314618271785E-4,2.124922209137708E-71,0.9998070268538172]], 
> [[3.118816393732361E-27,1.0,6.5310299615983584E-21]], 
> [[0.926009854522,8.734773657627494E-206,7.399014547943611E-6]])
> scala> model.transform(data).select("prediction").take(10)
> [rdd_68_0]
> res5: Array[org.apache.spark.sql.Row] = Array([1.0], [1.0], [1.0], [1.0], 
> [0.0], [1.0], [2.0], [2.0], [1.0], [0.0])
> {code}
> GaussianNB in scikit-learn
> {code}
> import numpy as np
> from sklearn.naive_bayes import GaussianNB
> from sklearn.datasets import load_svmlight_file
> path = 
> '/Users/zrf/.dev/spark-2.1.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt'
> X, y = load_svmlight_file(path)
> X = X.toarray()
> clf = GaussianNB()
> clf.fit(X, y)
> >>> clf.class_prior_
> array([ 0.,  0.,  0.])
> >>> clf.theta_
> array([[ 0.2701, -0.1885,  0.54305072,  0.6055],
>[-0.6078,  0.1817, -0.84271174, -0.8814],
>[-0.0914, -0.3586,  0.10508474,  0.0216667 ]])
>
> >>> clf.sigma_
> array([[ 0.12230125,  0.07078052,  0.03430001,  0.05133607],
>[ 0.03758145,  0.0988028 ,  0.0033903 ,  0.00782224],
>  

[jira] [Resolved] (SPARK-13677) Support Tree-Based Feature Transformation for ML

2019-05-08 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-13677.
--
Resolution: Not A Problem

> Support Tree-Based Feature Transformation for ML
> 
>
> Key: SPARK-13677
> URL: https://issues.apache.org/jira/browse/SPARK-13677
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> It would be nice to be able to use RF and GBT for feature transformation:
> First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
> the training set. Then each leaf of each tree in the ensemble is assigned a 
> fixed arbitrary feature index in a new feature space. These leaf indices are 
> then encoded in a one-hot fashion.
> This method was first introduced by 
> facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is 
> implemented in two famous library:
> sklearn 
> (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py)
> xgboost 
> (https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py)
> I have implement it in mllib:
> val features : RDD[Vector] = ...
> val model1 : RandomForestModel = ...
> val transformed1 : RDD[Vector] = model1.leaf(features)
> val model2 : GradientBoostedTreesModel = ...
> val transformed2 : RDD[Vector] = model2.leaf(features)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13385) Enable AssociationRules to generate consequents with user-defined lengths

2019-05-08 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-13385.
--
Resolution: Not A Problem

> Enable AssociationRules to generate consequents with user-defined lengths
> -
>
> Key: SPARK-13385
> URL: https://issues.apache.org/jira/browse/SPARK-13385
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Attachments: rule-generation.pdf
>
>
> AssociationRules should generates all association rules with user-defined 
> iterations, no just rules which have a single item as the consequent.
> Such as:
> 39 804 ==> 413 743 819 #SUP: 1023 #CONF: 0.70117
> 39 743 ==> 413 804 819 #SUP: 1023 #CONF: 0.93939
> 39 413 ==> 743 804 819 #SUP: 1023 #CONF: 0.6007
> 819 ==> 39 413 743 804 #SUP: 1023 #CONF: 0.15418
> 804 ==> 39 413 743 819 #SUP: 1023 #CONF: 0.12997
> 743 ==> 39 413 804 819 #SUP: 1023 #CONF: 0.7276
> 39 ==> 413 743 804 819 #SUP: 1023 #CONF: 0.12874
> ...
> I have implemented it based on Apriori's Rule-Generation Algorithm:
> https://github.com/zhengruifeng/spark-rules
> It's compatible with fpm's APIs.
> import org.apache.spark.mllib.fpm._
> val data = sc.textFile("hdfs://ns1/whale/T40I10D100K.dat")
> val transactions = data.map(s => s.trim.split(' ')).persist()
> val fpg = new FPGrowth().setMinSupport(0.01)
> val model = fpg.run(transactions)
> val ar = new AprioriRules().setMinConfidence(0.1).setMaxConsequent(15)
> val results = ar.run(model.freqItemsets)
> and it output rule-generation infomation like this:
> 15/11/04 11:28:46 INFO AprioriRules: Candidates for 1-consequent rules : 
> 312917
> 15/11/04 11:28:58 INFO AprioriRules: Generated 1-consequent rules : 306703
> 15/11/04 11:29:10 INFO AprioriRules: Candidates for 2-consequent rules : 
> 707747
> 15/11/04 11:29:35 INFO AprioriRules: Generated 2-consequent rules : 704000
> 15/11/04 11:29:55 INFO AprioriRules: Candidates for 3-consequent rules : 
> 1020253
> 15/11/04 11:30:38 INFO AprioriRules: Generated 3-consequent rules : 1014002
> 15/11/04 11:31:14 INFO AprioriRules: Candidates for 4-consequent rules : 
> 972225
> 15/11/04 11:32:00 INFO AprioriRules: Generated 4-consequent rules : 956483
> 15/11/04 11:32:44 INFO AprioriRules: Candidates for 5-consequent rules : 
> 653749
> 15/11/04 11:33:32 INFO AprioriRules: Generated 5-consequent rules : 626993
> 15/11/04 11:34:07 INFO AprioriRules: Candidates for 6-consequent rules : 
> 331038
> 15/11/04 11:34:50 INFO AprioriRules: Generated 6-consequent rules : 314455
> 15/11/04 11:35:10 INFO AprioriRules: Candidates for 7-consequent rules : 
> 138490
> 15/11/04 11:35:43 INFO AprioriRules: Generated 7-consequent rules : 136260
> 15/11/04 11:35:57 INFO AprioriRules: Candidates for 8-consequent rules : 48567
> 15/11/04 11:36:14 INFO AprioriRules: Generated 8-consequent rules : 47331
> 15/11/04 11:36:24 INFO AprioriRules: Candidates for 9-consequent rules : 12430
> 15/11/04 11:36:33 INFO AprioriRules: Generated 9-consequent rules : 11925
> 15/11/04 11:36:37 INFO AprioriRules: Candidates for 10-consequent rules : 2211
> 15/11/04 11:36:47 INFO AprioriRules: Generated 10-consequent rules : 2064
> 15/11/04 11:36:55 INFO AprioriRules: Candidates for 11-consequent rules : 246
> 15/11/04 11:36:58 INFO AprioriRules: Generated 11-consequent rules : 219
> 15/11/04 11:37:00 INFO AprioriRules: Candidates for 12-consequent rules : 13
> 15/11/04 11:37:03 INFO AprioriRules: Generated 12-consequent rules : 11
> 15/11/04 11:37:03 INFO AprioriRules: Candidates for 13-consequent rules : 0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14174) Implement the Mini-Batch KMeans

2019-05-08 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-14174.
--
Resolution: Not A Problem

> Implement the Mini-Batch KMeans
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Major
> Attachments: MBKM.xlsx
>
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> Comparison of the K-Means and MiniBatchKMeans on sklearn : 
> http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py
> Since MiniBatch-KMeans with fraction=1.0 is not equal to KMeans, so I make it 
> a new estimator



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2019-05-08 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-19208.
--
Resolution: Not A Problem

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Major
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18518) HasSolver should support allowed values

2019-05-08 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-18518.
--
Resolution: Not A Problem

> HasSolver should support allowed values
> ---
>
> Key: SPARK-18518
> URL: https://issues.apache.org/jira/browse/SPARK-18518
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> {{HasSolver}} now don't support value validation, so it's not easy to use:
> GLR and LiR inherit HasSolver, but need to do param validation explicitly 
> like:
> {code}
> require(Set("auto", "l-bfgs", "normal").contains(value),
>   s"Solver $value was not supported. Supported options: auto, l-bfgs, 
> normal")
> set(solver, value)
> {code}
> MLPC even don't inherit {{HasSolver}}, and create param solver with 
> supportedSolvers:
> {code}
>   final val solver: Param[String] = new Param[String](this, "solver",
> "The solver algorithm for optimization. Supported options: " +
>   s"${MultilayerPerceptronClassifier.supportedSolvers.mkString(", ")}. 
> (Default l-bfgs)",
> 
> ParamValidators.inArray[String](MultilayerPerceptronClassifier.supportedSolvers))
> {code}
> It may be reasonable to modify {{HasSolver}} after 2.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20711) MultivariateOnlineSummarizer/Summarizer incorrect min/max for NaN value

2019-05-08 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-20711.
--
Resolution: Won't Fix

> MultivariateOnlineSummarizer/Summarizer incorrect min/max for NaN value
> ---
>
> Key: SPARK-20711
> URL: https://issues.apache.org/jira/browse/SPARK-20711
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Minor
>
> {code}
> scala> val summarizer = new MultivariateOnlineSummarizer()
> summarizer: org.apache.spark.mllib.stat.MultivariateOnlineSummarizer = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.add(Vectors.dense(Double.NaN, -10.0))
> res20: summarizer.type = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.add(Vectors.dense(Double.NaN, 2.0))
> res21: summarizer.type = 
> org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@2ac58d
> scala> summarizer.min
> res22: org.apache.spark.mllib.linalg.Vector = [1.7976931348623157E308,-10.0]
> scala> summarizer.max
> res23: org.apache.spark.mllib.linalg.Vector = [-1.7976931348623157E308,2.0]
> {code}
> For a feature only containing {{Double.NaN}}, the returned max is 
> {{Double.MinValue}} and the min is {{Double.MaxValue}}.
> {code}
> import org.apache.spark.ml.stat._
> val df = spark.createDataFrame(Seq(
>   (1, 2.3, Vectors.dense(Double.NaN, 0.0, 0.0)),
>   (2, 6.7, Vectors.dense(Double.NaN, 0.2, 0.0))
> )).toDF("id", "num", "features")
> df.select(Summarizer.metrics("mean").summary(col("features"))).head
> res2: org.apache.spark.sql.Row = [[WrappedArray(NaN, 0.1, 0.0)]]
> df.select(Summarizer.metrics("min").summary(col("features"))).head
> res3: org.apache.spark.sql.Row = [[WrappedArray(1.7976931348623157E308, 0.0, 
> 0.0)]]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21879) Should Scalers handel NaN values?

2019-05-08 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-21879.
--
Resolution: Won't Fix

> Should Scalers handel NaN values?
> -
>
> Key: SPARK-21879
> URL: https://issues.apache.org/jira/browse/SPARK-21879
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Priority: Major
>
> The way {{ML.Scalers}} handling {{NaN}} is somewhat unexpected. Current impl 
> of {{MinMaxScaler}}/{{MaxAbsScaler}}/{{StandardScaler}} all support {{fit}} 
> and {{transform}} on a dataset containing {{NaN}}.
> Note that values in the second column in the following dataframe are all 
> {{NaN}}, and the coefficients of {{min/max}} in {{MinMaxScalerModel}} and 
> {{maxAbs}} in {{MaxAbsScaler}} are wrong.
> {code}
> import org.apache.spark.ml.feature._
> import org.apache.spark.ml.linalg.{Vector, Vectors}
> scala> val data = Array(
>  |   Vectors.dense(1, Double.NaN, Double.NaN, 2.0),
>  |   Vectors.dense(2, Double.NaN, 0.0, 3.0),
>  |   Vectors.dense(3, Double.NaN, 0.0, 1.0),
>  |   Vectors.dense(6, Double.NaN, 2.0, Double.NaN)).zipWithIndex
> data: Array[(org.apache.spark.ml.linalg.Vector, Int)] = 
> Array(([1.0,NaN,NaN,2.0],0), ([2.0,NaN,0.0,3.0],1), ([3.0,NaN,0.0,1.0],2), 
> ([6.0,NaN,2.0,NaN],3))
> scala> val df = data.toSeq.toDF("features", "id")
> df: org.apache.spark.sql.DataFrame = [features: vector, id: int]
> scala> val scaler = new 
> MinMaxScaler().setInputCol("features").setOutputCol("scaled")
> scaler: org.apache.spark.ml.feature.MinMaxScaler = minMaxScal_7634802f5c81
>   
> scala> val model = scaler.fit(df)
> model: org.apache.spark.ml.feature.MinMaxScalerModel = minMaxScal_7634802f5c81
> scala> model.originalMax
> res1: org.apache.spark.ml.linalg.Vector = 
> [6.0,-1.7976931348623157E308,2.0,3.0]
> scala> model.originalMin
> res2: org.apache.spark.ml.linalg.Vector = [1.0,1.7976931348623157E308,0.0,1.0]
> scala> model.transform(df).select("scaled").collect
> res3: Array[org.apache.spark.sql.Row] = Array([[0.0,NaN,NaN,0.5]], 
> [[0.2,NaN,0.0,1.0]], [[0.4,NaN,0.0,0.0]], [[1.0,NaN,1.0,NaN]])
> scala> val scaler2 = new 
> MaxAbsScaler().setInputCol("features").setOutputCol("scaled")
> scaler2: org.apache.spark.ml.feature.MaxAbsScaler = maxAbsScal_5d34fa818229
> scala> val model2 = scaler2.fit(df)
> model2: org.apache.spark.ml.feature.MaxAbsScalerModel = 
> maxAbsScal_5d34fa818229
> scala> model2.maxAbs
> res4: org.apache.spark.ml.linalg.Vector = [6.0,1.7976931348623157E308,2.0,3.0]
> scala> model2.transform(df).select("scaled").collect
> res5: Array[org.apache.spark.sql.Row] = 
> Array([[0.1,NaN,NaN,0.]], 
> [[0.,NaN,0.0,1.0]], [[0.5,NaN,0.0,0.]], 
> [[1.0,NaN,1.0,NaN]])
> scala> val scaler3 = new 
> StandardScaler().setInputCol("features").setOutputCol("scaled")
> scaler3: org.apache.spark.ml.feature.StandardScaler = stdScal_d8509095e860
> scala> val model3 = scaler3.fit(df)
> model3: org.apache.spark.ml.feature.StandardScalerModel = stdScal_d8509095e860
> scala> model3.std
> res11: org.apache.spark.ml.linalg.Vector = [2.160246899469287,NaN,NaN,NaN]
> scala> model3.mean
> res12: org.apache.spark.ml.linalg.Vector = [3.0,NaN,NaN,NaN]
> scala> model3.transform(df).select("scaled").collect
> res14: Array[org.apache.spark.sql.Row] = 
> Array([[0.4629100498862757,NaN,NaN,NaN]], [[0.9258200997725514,NaN,NaN,NaN]], 
> [[1.3887301496588271,NaN,NaN,NaN]], [[2.7774602993176543,NaN,NaN,NaN]])
> {code}
> I then test the scalers in scikit-learn, and they all throw exceptions in 
> both {{fit}} and {{transform}}.
> {code}
> import numpy as np
> from sklearn.preprocessing import *
> data = np.array([[-1, 2], [-0.5, 6], [0, np.nan], [1, 1.8]])
> data2 = np.array([[-1, 2], [-0.5, 6], [0, 2.0], [1, 1.8]])
> for scaler in [StandardScaler(), MinMaxScaler(), MaxAbsScaler(), 
> RobustScaler()]:
> try:
> scaler.fit(data)
> except:
> print('{0}.fit fails'.format(scaler))
> model = scaler.fit(data2)
> try:
> model.transform(data)
> except:
> print('{0}.transform fails'.format(scaler))
> 
> StandardScaler(copy=True, with_mean=True, with_std=True).fit fails
> StandardScaler(copy=True, with_mean=True, with_std=True).transform fails
> MinMaxScaler(copy=True, feature_range=(0, 1)).fit fails
> MinMaxScaler(copy=True, feature_range=(0, 1)).transform fails
> MaxAbsScaler(copy=True).fit fails
> MaxAbsScaler(copy=True).transform fails
> RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
>with_scaling=True).fit fails
> RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
>with_scaling=True).transform fails
> {code}
> I think the 

[jira] [Resolved] (SPARK-18757) Models in Pyspark support column setters

2019-05-08 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-18757.
--
Resolution: Not A Problem

> Models in Pyspark support column setters
> 
>
> Key: SPARK-18757
> URL: https://issues.apache.org/jira/browse/SPARK-18757
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Priority: Major
>
> Recently, I found three places in which column setters are missing: 
> KMeansModel, BisectingKMeansModel and OneVsRestModel.
> These three models directly inherit `Model` which dont have columns setters, 
> so I had to add the missing setters manually in [SPARK-18625] and 
> [SPARK-18520].
> Fow now, models in pyspark still don't support column setters at all.
> I suggest that we keep the hierarchy of pyspark models in line with that in 
> the scala side:
> For classifiation and regression algs, I‘m making a trial in [SPARK-18739]. 
> In it, I try to copy the hierarchy from the scala side.
> For clustering algs, I think we may first create abstract classes 
> {{ClusteringModel}} and {{ProbabilisticClusteringModel}} in the scala side, 
> and make clustering algs inherit it. Then, in the python side, we copy the 
> hierarchy so that we dont need to add setters manually for each alg.
> For features algs, we can also use a abstract class {{FeatureModel}} in scala 
> side, and do the same thing.
> What's your opinions? [~yanboliang][~josephkb][~sethah][~srowen]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24664) Column support name getter

2019-05-08 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-24664.
--
Resolution: Not A Problem

> Column support name getter
> --
>
> Key: SPARK-24664
> URL: https://issues.apache.org/jira/browse/SPARK-24664
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Priority: Minor
>
> In spark-24557 (https://github.com/apache/spark/pull/21563), we found that it 
> will be convenient if column supports name getter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23805) support vector-size validation and Inference

2019-05-08 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-23805.
--
Resolution: Not A Problem

> support vector-size validation and Inference
> 
>
> Key: SPARK-23805
> URL: https://issues.apache.org/jira/browse/SPARK-23805
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Priority: Major
>
> I think it maybe miningful to unify the usage of \{{AttributeGroup}} and 
> support vector-size validation and inference in algs.
> My thoughts are:
>  * In \{{transformSchema}}, validate the input vector-size if possible. If 
> the input vector-size can be obtained from schema, check it.
>  ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will 
> require the vector-size to be no more than 4.
>  ** Suppose a \{{PCAModel}} trained with vectors of length 10, the 
> \{{transformSchema}} will require the vector-size to be 10.
>  * In \{{transformSchema}}, inference the output vector-size if possible.
>  ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will 
> return a schema with output vector-size=4.
>  ** Suppose a \{{PCAModel}} trained with k=4, the \{{transformSchema}} will 
> return a schema with output vector-size=4.
>  * In \{{transform}}, inference the output vector-size if possible.
>  * In \{{fit}}, obtain the input vector-size from schema if possible. This 
> can help eliminating redundant \{{first}} jobs.
>  
> Current PR only modifies \{{PCA}} and \{{MaxAbsScaler}} to illustrate my 
> idea. Since the validation and inference is quite alg-speciafic, we may need 
> to sperate the task into several small subtasks.
> How do you think about this? [~srowen] [~yanboliang] [~WeichenXu123] [~mlnick]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27656) Safely register class for GraphX

2019-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27656:


Assignee: (was: Apache Spark)

> Safely register class for GraphX
> 
>
> Key: SPARK-27656
> URL: https://issues.apache.org/jira/browse/SPARK-27656
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.4.3
>Reporter: zhengruifeng
>Priority: Major
>
> GraphX common classes (such as: Edge, EdgeTriplet) are not registered in Kryo 
> by default.
> Users can register those classes via 
> {{GraphXUtils.{color:#ffc66d}registerKryoClasses{color}}}, however, it seems 
> that none graphx-lib impls call it, and users tend to ignore this 
> registration.
> So I prefer to safely register them in \{{KryoSerializer.scala}}, like what  
> SQL and ML do.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27656) Safely register class for GraphX

2019-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27656:


Assignee: Apache Spark

> Safely register class for GraphX
> 
>
> Key: SPARK-27656
> URL: https://issues.apache.org/jira/browse/SPARK-27656
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.4.3
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>
> GraphX common classes (such as: Edge, EdgeTriplet) are not registered in Kryo 
> by default.
> Users can register those classes via 
> {{GraphXUtils.{color:#ffc66d}registerKryoClasses{color}}}, however, it seems 
> that none graphx-lib impls call it, and users tend to ignore this 
> registration.
> So I prefer to safely register them in \{{KryoSerializer.scala}}, like what  
> SQL and ML do.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27656) Safely register class for GraphX

2019-05-08 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-27656:


 Summary: Safely register class for GraphX
 Key: SPARK-27656
 URL: https://issues.apache.org/jira/browse/SPARK-27656
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 2.4.3
Reporter: zhengruifeng


GraphX common classes (such as: Edge, EdgeTriplet) are not registered in Kryo 
by default.

Users can register those classes via 
{{GraphXUtils.{color:#ffc66d}registerKryoClasses{color}}}, however, it seems 
that none graphx-lib impls call it, and users tend to ignore this registration.

So I prefer to safely register them in \{{KryoSerializer.scala}}, like what  
SQL and ML do.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27600) Unable to start Spark Hive Thrift Server when multiple hive server server share the same metastore

2019-05-08 Thread pin_zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835437#comment-16835437
 ] 

pin_zhang edited comment on SPARK-27600 at 5/8/19 9:02 AM:
---

[~hyukjin.kwon] I think this is relate to a hive bug 
https://issues.apache.org/jira/browse/HIVE-6113

It shows "The exception appears when there are several processes working with 
Hive concurrently." In hive's fix upgrade third-party datanucleus.

Is it a spark's bug if spark use the hive 1.2.1?

 


was (Author: pin_zhang):
I think this is relate to a hive bug 
https://issues.apache.org/jira/browse/HIVE-6113

It shows "The exception appears when there are several processes working with 
Hive concurrently." In hive's fix upgrade third-party datanucleus.

Is it a spark's bug if spark use the hive 1.2.1?

 

> Unable to start Spark Hive Thrift Server when multiple hive server server 
> share the same metastore
> --
>
> Key: SPARK-27600
> URL: https://issues.apache.org/jira/browse/SPARK-27600
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: pin_zhang
>Priority: Major
>
> When start ten or more spark hive thrift servers at the same time, more than 
> one version saved to table VERSION when meet exception WARN 
> [DataNucleus.Query] (main:) Query for candidates of 
> org.apache.hadoop.hive.metastore.model.MVersionTable and subclasses resulted 
> in no possible candidates
> Exception thrown obtaining schema column information from datastore
> org.datanucleus.exceptions.NucleusDataStoreException: Exception thrown 
> obtaining schema column information from datastore
> Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Table 
> 'via_ms.deleteme1556239494724' doesn't exist
>  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>  at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>  at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
>  at com.mysql.jdbc.Util.getInstance(Util.java:408)
>  at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:944)
>  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3978)
>  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3914)
>  at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2530)
>  at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2683)
>  at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2491)
>  at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2449)
>  at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1381)
>  at com.mysql.jdbc.DatabaseMetaData$2.forEach(DatabaseMetaData.java:2441)
>  at com.mysql.jdbc.DatabaseMetaData$2.forEach(DatabaseMetaData.java:2339)
>  at com.mysql.jdbc.IterateBlock.doForAll(IterateBlock.java:50)
>  at com.mysql.jdbc.DatabaseMetaData.getColumns(DatabaseMetaData.java:2337)
>  at 
> org.apache.commons.dbcp.DelegatingDatabaseMetaData.getColumns(DelegatingDatabaseMetaData.java:218)
>  at 
> org.datanucleus.store.rdbms.adapter.BaseDatastoreAdapter.getColumns(BaseDatastoreAdapter.java:1532)
>  at 
> org.datanucleus.store.rdbms.schema.RDBMSSchemaHandler.refreshTableData(RDBMSSchemaHandler.java:921)
> Then cannot start hive server any more because of 
> MetaException(message:Metastore contains multiple versions (2) 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27600) Unable to start Spark Hive Thrift Server when multiple hive server server share the same metastore

2019-05-08 Thread pin_zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835437#comment-16835437
 ] 

pin_zhang commented on SPARK-27600:
---

I think this is relate to a hive bug 
https://issues.apache.org/jira/browse/HIVE-6113

It shows "The exception appears when there are several processes working with 
Hive concurrently." In hive's fix upgrade third-party datanucleus.

Is it a spark's bug if spark use the hive 1.2.1?

 

> Unable to start Spark Hive Thrift Server when multiple hive server server 
> share the same metastore
> --
>
> Key: SPARK-27600
> URL: https://issues.apache.org/jira/browse/SPARK-27600
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: pin_zhang
>Priority: Major
>
> When start ten or more spark hive thrift servers at the same time, more than 
> one version saved to table VERSION when meet exception WARN 
> [DataNucleus.Query] (main:) Query for candidates of 
> org.apache.hadoop.hive.metastore.model.MVersionTable and subclasses resulted 
> in no possible candidates
> Exception thrown obtaining schema column information from datastore
> org.datanucleus.exceptions.NucleusDataStoreException: Exception thrown 
> obtaining schema column information from datastore
> Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Table 
> 'via_ms.deleteme1556239494724' doesn't exist
>  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>  at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>  at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
>  at com.mysql.jdbc.Util.getInstance(Util.java:408)
>  at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:944)
>  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3978)
>  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3914)
>  at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2530)
>  at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2683)
>  at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2491)
>  at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2449)
>  at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1381)
>  at com.mysql.jdbc.DatabaseMetaData$2.forEach(DatabaseMetaData.java:2441)
>  at com.mysql.jdbc.DatabaseMetaData$2.forEach(DatabaseMetaData.java:2339)
>  at com.mysql.jdbc.IterateBlock.doForAll(IterateBlock.java:50)
>  at com.mysql.jdbc.DatabaseMetaData.getColumns(DatabaseMetaData.java:2337)
>  at 
> org.apache.commons.dbcp.DelegatingDatabaseMetaData.getColumns(DelegatingDatabaseMetaData.java:218)
>  at 
> org.datanucleus.store.rdbms.adapter.BaseDatastoreAdapter.getColumns(BaseDatastoreAdapter.java:1532)
>  at 
> org.datanucleus.store.rdbms.schema.RDBMSSchemaHandler.refreshTableData(RDBMSSchemaHandler.java:921)
> Then cannot start hive server any more because of 
> MetaException(message:Metastore contains multiple versions (2) 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27649) Unify the way you use 'spark.network.timeout'

2019-05-08 Thread Gabor Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-27649.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24545
https://github.com/apache/spark/pull/24545

> Unify the way you use 'spark.network.timeout'
> -
>
> Key: SPARK-27649
> URL: https://issues.apache.org/jira/browse/SPARK-27649
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Minor
> Fix For: 3.0.0
>
>
> For historical reasons, structured streaming still has some old way of use
> {code:java}
> spark.network.timeout{code}
> , even though 
> {code:java}
> org.apache.spark.internal.config.Network.NETWORK_TIMEOUT{code}
> is now available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27622) Avoid the network when block manager fetches disk persisted RDD blocks from the same host

2019-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27622:


Assignee: (was: Apache Spark)

> Avoid the network when block manager fetches disk persisted RDD blocks from 
> the same host
> -
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27622) Avoid the network when block manager fetches disk persisted RDD blocks from the same host

2019-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27622:


Assignee: Apache Spark

> Avoid the network when block manager fetches disk persisted RDD blocks from 
> the same host
> -
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Apache Spark
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time

2019-05-08 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835420#comment-16835420
 ] 

Gabor Somogyi commented on SPARK-27648:
---

[~yy3b2007com] you've mentioned this: {quote}aggregate the stream data 
sources{quote}
Are you using stateful processing? In such case the state eviction is key part.

> In Spark2.4 Structured Streaming:The executor storage memory increasing over 
> time
> -
>
> Key: SPARK-27648
> URL: https://issues.apache.org/jira/browse/SPARK-27648
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: tommy duan
>Priority: Major
>
> *Spark Program Code Business:*
>  Read the topic on kafka, aggregate the stream data sources, and then output 
> it to another topic line of kafka.
> *Problem Description:*
>  *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
> overflow problems often occur (because of too many versions of state stored 
> in memory, this bug has been modified in spark 2.4).
> {code:java}
> /spark-submit \
> --conf “spark.yarn.executor.memoryOverhead=4096M”
> --num-executors 15 \
> --executor-memory 3G \
> --executor-cores 2 \
> --driver-memory 6G \{code}
> {code}
> Executor memory exceptions occur when running with this submit resource under 
> SPARK 2.2 and the normal running time does not exceed one day.
> The solution is to set the executor memory larger than before 
> {code:java}
>  My spark-submit script is as follows:
> /spark-submit\
> conf "spark. yarn. executor. memoryOverhead = 4096M"
> num-executors 15\
> executor-memory 46G\
> executor-cores 3\
> driver-memory 6G\
> ...{code}
> In this case, the spark program can be guaranteed to run stably for a long 
> time, and the executor storage memory is less than 10M (it has been running 
> stably for more than 20 days).
> *2) From the upgrade information of Spark 2.4, we can see that the problem of 
> large memory consumption of state storage has been solved in Spark 2.4.* 
>  So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, 
> and found that the use of memory was reduced.
>  But a problem arises, as the running time increases, the storage memory of 
> executor is growing (see Executors - > Storage Memory from the Spark on Yarn 
> Resource Manager UI).
>  This program has been running for 14 days (under SPARK 2.2, running with 
> this submit resource, the normal running time is not more than one day, 
> Executor memory abnormalities will occur).
>  The script submitted by the program under spark2.4 is as follows:
> {code:java}
> /spark-submit \
>  --conf “spark.yarn.executor.memoryOverhead=4096M”
>  --num-executors 15 \
>  --executor-memory 3G \
>  --executor-cores 2 \
>  --driver-memory 6G 
> {code}
> Under Spark 2.4, I counted the size of executor memory as time went by during 
> the running of the spark program:
> |Run-time(hour)|Storage Memory size(MB)|Memory growth rate(MB/hour)|
> |23.5H|41.6MB/1.5GB|1.770212766|
> |108.4H|460.2MB/1.5GB|4.245387454|
> |131.7H|559.1MB/1.5GB|4.245254366|
> |135.4H|575MB/1.5GB|4.246676514|
> |153.6H|641.2MB/1.5GB|4.174479167|
> |219H|888.1MB/1.5GB|4.055251142|
> |263H|1126.4MB/1.5GB|4.282889734|
> |309H|1228.8MB/1.5GB|3.976699029|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release

2019-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835339#comment-16835339
 ] 

Apache Spark commented on SPARK-18406:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/24552

> Race between end-of-task and completion iterator read lock release
> --
>
> Key: SPARK-18406
> URL: https://issues.apache.org/jira/browse/SPARK-18406
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Josh Rosen
>Assignee: Xingbo Jiang
>Priority: Major
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>
> The following log comes from a production streaming job where executors 
> periodically die due to uncaught exceptions during block release:
> {code}
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921
> 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922
> 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923
> 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable 
> 2721
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924
> 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924)
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as 
> bytes in memory (estimated size 5.0 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took 
> 3 ms
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in 
> memory (estimated size 9.4 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = 
> 567, finish = 1
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = 
> 541, finish = 6
> 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID 
> 7923). 1429 bytes result sent to driver
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = 
> 533, finish = 7
> 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID 
> 7924). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID 
> 7921)
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
>   at 
> org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925
> 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = 
> 576, finish = 1
> 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID 
> 7922). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Utils: Uncaught 

[jira] [Resolved] (SPARK-27642) make v1 offset extends v2 offset

2019-05-08 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27642.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> make v1 offset extends v2 offset
> 
>
> Key: SPARK-27642
> URL: https://issues.apache.org/jira/browse/SPARK-27642
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org