[jira] [Updated] (SPARK-24343) Avoid shuffle for the bucketed table when shuffle.partition > bucket number

2018-05-22 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-24343:
--
Description: 
When shuffle.partition > bucket number, Spark needs to shuffle the bucket table 
as per the shuffle.partition, can we avoid this?

See below example:
{code:java}
CREATE TABLE dev
USING PARQUET
AS SELECT ws_item_sk, i_item_sk
FROM web_sales_bucketed
JOIN item ON ws_item_sk = i_item_sk;{code}
Suppose web_sales_bucketed is bucketed into 4 buckets and set 
shuffle.partition=10.

Currently, both tables are shuffled into 10 partitions.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ws_item_sk#2, 10)
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 10)
+- *(3) Project [i_item_sk#6]
+- *(3) Filter isnotnull(i_item_sk#6)
+- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...
{code}
A better plan should avoid the shuffle in the bucket table.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 4)
+- *(2) Project [i_item_sk#6]
+- *(2) Filter isnotnull(i_item_sk#6)
+- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...{code}
This problem could be worse if we enable the adaptive execution, because it 
usually prefers a big shuffle.parititon.

 

  was:
When shuffle.partition > bucket number, Spark needs to shuffle the bucket table 
as per the shuffle.partition, can we avoid this?

See below example:
{code:java}
CREATE TABLE dev
USING PARQUET
AS SELECT ws_item_sk, i_item_sk
FROM web_sales_bucketed
JOIN item ON ws_item_sk = i_item_sk;{code}
Suppose web_sales_bucketed is bucketed into 4 buckets and set 
shuffle.partition=10

Currently, both tables are shuffled into 10 partitions.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ws_item_sk#2, 10)
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 10)
+- *(3) Project [i_item_sk#6]
+- *(3) Filter isnotnull(i_item_sk#6)
+- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...
{code}
A better plan should avoid the shuffle in the bucket table.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 4)
+- *(2) Project [i_item_sk#6]
+- *(2) Filter isnotnull(i_item_sk#6)
+- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...{code}
This problem could be worse if we enable the adaptive execution, because it 
usually prefers a big shuffle.parititon.

 


> Avoid shuffle for the bucketed table when shuffle.partition > bucket number
> ---
>
> Key: SPARK-24343
> URL: https://issues.apache.org/jira/browse/SPARK-24343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
>
> When shuffle.partition > bucket number, Spark needs to shuffle the bucket 
> table as per the shuffle.partition, can we avoid this?
> See below example:

[jira] [Updated] (SPARK-24343) Avoid shuffle for the bucketed table when shuffle.partition > bucket number

2018-05-22 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-24343:
--
Description: 
When shuffle.partition > bucket number, Spark needs to shuffle the bucket table 
as per the shuffle.partition, can we avoid this?

See below example:
{code:java}
CREATE TABLE dev
USING PARQUET
AS SELECT ws_item_sk, i_item_sk
FROM web_sales_bucketed
JOIN item ON ws_item_sk = i_item_sk;{code}
web_sales_bucketed is bucketed into 4 buckets and set shuffle.partition=10

Currently, both tables are shuffled into 10 partitions.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ws_item_sk#2, 10)
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 10)
+- *(3) Project [i_item_sk#6]
+- *(3) Filter isnotnull(i_item_sk#6)
+- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...
{code}
A better plan should avoid the shuffle in the bucket table.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 4)
+- *(2) Project [i_item_sk#6]
+- *(2) Filter isnotnull(i_item_sk#6)
+- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...{code}
This problem could be worse if we enable the adaptive execution, because it 
usually prefers a big shuffle.parititon.

 

  was:
When shuffle.partition > bucket number, Spark needs to shuffle the bucket table 
as per the shuffle.partition, can we avoid this?

See below example:
{code:java}
CREATE TABLE dev
USING PARQUET
AS SELECT ws_item_sk, i_item_sk
FROM web_sales_bucketed
JOIN item ON ws_item_sk = i_item_sk;{code}
web_sales_bucketed is bucketed into 4 buckets and set shuffle.partition=10

Currently, both tables are shuffled into 10 partitions.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ws_item_sk#2, 10)
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 10)
+- *(3) Project [i_item_sk#6]
+- *(3) Filter isnotnull(i_item_sk#6)
+- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...
{code}
A better plan should avoid the shuffle in the bucket table.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 4)
+- *(2) Project [i_item_sk#6]
+- *(2) Filter isnotnull(i_item_sk#6)
+- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...{code}
 

 

This problem could be worse if we enable the adaptive execution, because it 
usually prefers a big shuffle.parititon.

 


> Avoid shuffle for the bucketed table when shuffle.partition > bucket number
> ---
>
> Key: SPARK-24343
> URL: https://issues.apache.org/jira/browse/SPARK-24343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
>
> When shuffle.partition > bucket number, Spark needs to shuffle the bucket 
> table as per the shuffle.partition, can we avoid this?
> See below example:
> 

[jira] [Assigned] (SPARK-24063) Control maximum epoch backlog

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24063:


Assignee: Apache Spark

> Control maximum epoch backlog
> -
>
> Key: SPARK-24063
> URL: https://issues.apache.org/jira/browse/SPARK-24063
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Efim Poberezkin
>Assignee: Apache Spark
>Priority: Major
>
> As pointed out by [~joseph.torres] in 
> [https://github.com/apache/spark/pull/20936], both epoch queue and 
> commits/offsets maps are unbounded by the number of waiting epochs. According 
> to his proposal, we should introduce some configuration option for maximum 
> epoch backlog and report an error if the number of waiting epochs exceeds it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24063) Control maximum epoch backlog

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24063:


Assignee: (was: Apache Spark)

> Control maximum epoch backlog
> -
>
> Key: SPARK-24063
> URL: https://issues.apache.org/jira/browse/SPARK-24063
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Efim Poberezkin
>Priority: Major
>
> As pointed out by [~joseph.torres] in 
> [https://github.com/apache/spark/pull/20936], both epoch queue and 
> commits/offsets maps are unbounded by the number of waiting epochs. According 
> to his proposal, we should introduce some configuration option for maximum 
> epoch backlog and report an error if the number of waiting epochs exceeds it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24329) Remove comments filtering before parsing of CSV files

2018-05-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483848#comment-16483848
 ] 

Apache Spark commented on SPARK-24329:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21394

> Remove comments filtering before parsing of CSV files
> -
>
> Key: SPARK-24329
> URL: https://issues.apache.org/jira/browse/SPARK-24329
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Comments and whitespace filtering has been performed by uniVocity parser 
> already according to parser settings:
> https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L178-L180
> It is not necessary to do the same before parsing. Need to inspect all places 
> where the filterCommentAndEmpty method is called, and remove the former one 
> if it duplicates filtering of uniVocity parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24342) Large Task prior scheduling to Reduce overall execution time

2018-05-22 Thread gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gao updated SPARK-24342:

Attachment: tasktimespan.PNG

> Large Task prior scheduling to Reduce overall execution time
> 
>
> Key: SPARK-24342
> URL: https://issues.apache.org/jira/browse/SPARK-24342
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.3.0
>Reporter: gao
>Priority: Minor
> Attachments: tasktimespan.PNG
>
>
> When performing a set of concurrent tasks, if the relatively large task 
> (long-time task) performs the first small-task execution, the overall 
> execution time 
> can be shortened.
> Therefore, Spark needs to add a new function to perform Large Task of a group 
> of task set prior scheduling and small tasks after scheduling
>    The time span of the task can be identified based on the historical 
> execution time. We can think that the task with a long execution time will 
> longe in 
> future. Record the last task execution time together with the task's key as a 
> log file, and load the log file at the next execution time. use The 
> RangePartitioner and partitioning partitioning methods prioritize large tasks 
> and can achieve concurrent task optimization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24342) Large Task prior scheduling to Reduce overall execution time

2018-05-22 Thread gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gao updated SPARK-24342:

Attachment: (was: taskstimespan.png)

> Large Task prior scheduling to Reduce overall execution time
> 
>
> Key: SPARK-24342
> URL: https://issues.apache.org/jira/browse/SPARK-24342
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.3.0
>Reporter: gao
>Priority: Minor
> Attachments: tasktimespan.PNG
>
>
> When performing a set of concurrent tasks, if the relatively large task 
> (long-time task) performs the first small-task execution, the overall 
> execution time 
> can be shortened.
> Therefore, Spark needs to add a new function to perform Large Task of a group 
> of task set prior scheduling and small tasks after scheduling
>    The time span of the task can be identified based on the historical 
> execution time. We can think that the task with a long execution time will 
> longe in 
> future. Record the last task execution time together with the task's key as a 
> log file, and load the log file at the next execution time. use The 
> RangePartitioner and partitioning partitioning methods prioritize large tasks 
> and can achieve concurrent task optimization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24342) Large Task prior scheduling to Reduce overall execution time

2018-05-22 Thread gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gao updated SPARK-24342:

Attachment: tasktimespan.PNG

> Large Task prior scheduling to Reduce overall execution time
> 
>
> Key: SPARK-24342
> URL: https://issues.apache.org/jira/browse/SPARK-24342
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.3.0
>Reporter: gao
>Priority: Minor
> Attachments: tasktimespan.PNG
>
>
> When performing a set of concurrent tasks, if the relatively large task 
> (long-time task) performs the first small-task execution, the overall 
> execution time 
> can be shortened.
> Therefore, Spark needs to add a new function to perform Large Task of a group 
> of task set prior scheduling and small tasks after scheduling
>    The time span of the task can be identified based on the historical 
> execution time. We can think that the task with a long execution time will 
> longe in 
> future. Record the last task execution time together with the task's key as a 
> log file, and load the log file at the next execution time. use The 
> RangePartitioner and partitioning partitioning methods prioritize large tasks 
> and can achieve concurrent task optimization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24343) Avoid shuffle for the bucketed table when shuffle.partition > bucket number

2018-05-22 Thread yucai (JIRA)
yucai created SPARK-24343:
-

 Summary: Avoid shuffle for the bucketed table when 
shuffle.partition > bucket number
 Key: SPARK-24343
 URL: https://issues.apache.org/jira/browse/SPARK-24343
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: yucai


When shuffle.partition > bucket number, Spark needs to shuffle the bucket table 
as per the shuffle.partition, can we avoid this?

See below example:
{code:java}
CREATE TABLE dev
USING PARQUET
AS SELECT ws_item_sk, i_item_sk
FROM web_sales_bucketed
JOIN item ON ws_item_sk = i_item_sk;{code}
web_sales_bucketed is bucketed into 4 buckets and set shuffle.partition=10

Currently, both tables are shuffled into 10 partitions.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ws_item_sk#2, 10)
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 10)
+- *(3) Project [i_item_sk#6]
+- *(3) Filter isnotnull(i_item_sk#6)
+- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...
{code}
A better plan should avoid the shuffle in the bucket table.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 4)
+- *(2) Project [i_item_sk#6]
+- *(2) Filter isnotnull(i_item_sk#6)
+- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...{code}
 

 

This problem could be worse if we enable the adaptive execution, because it 
usually prefers a big shuffle.parititon.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23777) Missing DAG arrows between stages

2018-05-22 Thread Gabor Sudar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483701#comment-16483701
 ] 

Gabor Sudar commented on SPARK-23777:
-

Started to work on a fix for this.

> Missing DAG arrows between stages
> -
>
> Key: SPARK-23777
> URL: https://issues.apache.org/jira/browse/SPARK-23777
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0, 2.3.0
>Reporter: Stefano Pettini
>Priority: Trivial
> Attachments: Screenshot-2018-3-23 RDDTestApp - Details for Job 0.png
>
>
> In the Spark UI DAGs, sometimes there are missing arrows between stages. It 
> seems to happen when the same RDD is shuffled twice.
> For example in this case:
> {{val a = context.parallelize(List(10, 20, 30, 40, 50)).keyBy(_ / 10)}}
> {{val b = a join a}}
> {{b.collect()}}
> There's a missing arrow from stage 1 to 2.
> _This is an old one, since 1.3.0 at least, still reproducible in 2.3.0._



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24344) Spark SQL Thrift Server issue

2018-05-22 Thread L (JIRA)
L created SPARK-24344:
-

 Summary: Spark SQL Thrift Server issue
 Key: SPARK-24344
 URL: https://issues.apache.org/jira/browse/SPARK-24344
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: L


I want to use spark thrift server to operate the data in hive, and i have start 
spark thrift server successfully with port 10015,and the hive thrift server 
port is 1 which is default.But when i use beeline to connect spark thrift 
server,the error is below:

!image-2018-05-22-18-15-55-200.png!

The process of thrift server:

!image-2018-05-22-18-17-01-271.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24344) Spark SQL Thrift Server issue

2018-05-22 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-24344:

Priority: Major  (was: Blocker)

> Spark SQL Thrift Server issue
> -
>
> Key: SPARK-24344
> URL: https://issues.apache.org/jira/browse/SPARK-24344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: L
>Priority: Major
>
> I want to use spark thrift server to operate the data in hive, and i have 
> start spark thrift server successfully with port 10015,and the hive thrift 
> server port is 1 which is default.But when i use beeline to connect spark 
> thrift server,the error is below:
> !image-2018-05-22-18-15-55-200.png!
> The process of thrift server:
> !image-2018-05-22-18-17-01-271.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24343) Avoid shuffle for the bucketed table when shuffle.partition > bucket number

2018-05-22 Thread yucai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-24343:
--
Description: 
When shuffle.partition > bucket number, Spark needs to shuffle the bucket table 
as per the shuffle.partition, can we avoid this?

See below example:
{code:java}
CREATE TABLE dev
USING PARQUET
AS SELECT ws_item_sk, i_item_sk
FROM web_sales_bucketed
JOIN item ON ws_item_sk = i_item_sk;{code}
Suppose web_sales_bucketed is bucketed into 4 buckets and set 
shuffle.partition=10

Currently, both tables are shuffled into 10 partitions.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ws_item_sk#2, 10)
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 10)
+- *(3) Project [i_item_sk#6]
+- *(3) Filter isnotnull(i_item_sk#6)
+- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...
{code}
A better plan should avoid the shuffle in the bucket table.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 4)
+- *(2) Project [i_item_sk#6]
+- *(2) Filter isnotnull(i_item_sk#6)
+- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...{code}
This problem could be worse if we enable the adaptive execution, because it 
usually prefers a big shuffle.parititon.

 

  was:
When shuffle.partition > bucket number, Spark needs to shuffle the bucket table 
as per the shuffle.partition, can we avoid this?

See below example:
{code:java}
CREATE TABLE dev
USING PARQUET
AS SELECT ws_item_sk, i_item_sk
FROM web_sales_bucketed
JOIN item ON ws_item_sk = i_item_sk;{code}
web_sales_bucketed is bucketed into 4 buckets and set shuffle.partition=10

Currently, both tables are shuffled into 10 partitions.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ws_item_sk#2, 10)
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 10)
+- *(3) Project [i_item_sk#6]
+- *(3) Filter isnotnull(i_item_sk#6)
+- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...
{code}
A better plan should avoid the shuffle in the bucket table.
{code:java}
Execute CreateDataSourceTableAsSelectCommand 
CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
i_item_sk#6]
+- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
:- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
: +- *(1) Project [ws_item_sk#2]
: +- *(1) Filter isnotnull(ws_item_sk#2)
: +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
true, Format: Parquet, Location:...
+- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(i_item_sk#6, 4)
+- *(2) Project [i_item_sk#6]
+- *(2) Filter isnotnull(i_item_sk#6)
+- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
Parquet, Location:...{code}
This problem could be worse if we enable the adaptive execution, because it 
usually prefers a big shuffle.parititon.

 


> Avoid shuffle for the bucketed table when shuffle.partition > bucket number
> ---
>
> Key: SPARK-24343
> URL: https://issues.apache.org/jira/browse/SPARK-24343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
>
> When shuffle.partition > bucket number, Spark needs to shuffle the bucket 
> table as per the shuffle.partition, can we avoid this?
> See below example:
> 

[jira] [Assigned] (SPARK-24343) Avoid shuffle for the bucketed table when shuffle.partition > bucket number

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24343:


Assignee: (was: Apache Spark)

> Avoid shuffle for the bucketed table when shuffle.partition > bucket number
> ---
>
> Key: SPARK-24343
> URL: https://issues.apache.org/jira/browse/SPARK-24343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
>
> When shuffle.partition > bucket number, Spark needs to shuffle the bucket 
> table as per the shuffle.partition, can we avoid this?
> See below example:
> {code:java}
> CREATE TABLE dev
> USING PARQUET
> AS SELECT ws_item_sk, i_item_sk
> FROM web_sales_bucketed
> JOIN item ON ws_item_sk = i_item_sk;{code}
> Suppose web_sales_bucketed is bucketed into 4 buckets and set 
> shuffle.partition=10.
> Currently, both tables are shuffled into 10 partitions.
> {code:java}
> Execute CreateDataSourceTableAsSelectCommand 
> CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
> i_item_sk#6]
> +- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
> :- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(ws_item_sk#2, 10)
> : +- *(1) Project [ws_item_sk#2]
> : +- *(1) Filter isnotnull(ws_item_sk#2)
> : +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
> true, Format: Parquet, Location:...
> +- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(i_item_sk#6, 10)
> +- *(3) Project [i_item_sk#6]
> +- *(3) Filter isnotnull(i_item_sk#6)
> +- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
> Parquet, Location:...
> {code}
> A better plan should avoid the shuffle in the bucket table.
> {code:java}
> Execute CreateDataSourceTableAsSelectCommand 
> CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
> i_item_sk#6]
> +- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
> :- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
> : +- *(1) Project [ws_item_sk#2]
> : +- *(1) Filter isnotnull(ws_item_sk#2)
> : +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
> true, Format: Parquet, Location:...
> +- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(i_item_sk#6, 4)
> +- *(2) Project [i_item_sk#6]
> +- *(2) Filter isnotnull(i_item_sk#6)
> +- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
> Parquet, Location:...{code}
> This problem could be worse if we enable the adaptive execution, because it 
> usually prefers a big shuffle.parititon.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24343) Avoid shuffle for the bucketed table when shuffle.partition > bucket number

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24343:


Assignee: Apache Spark

> Avoid shuffle for the bucketed table when shuffle.partition > bucket number
> ---
>
> Key: SPARK-24343
> URL: https://issues.apache.org/jira/browse/SPARK-24343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Assignee: Apache Spark
>Priority: Major
>
> When shuffle.partition > bucket number, Spark needs to shuffle the bucket 
> table as per the shuffle.partition, can we avoid this?
> See below example:
> {code:java}
> CREATE TABLE dev
> USING PARQUET
> AS SELECT ws_item_sk, i_item_sk
> FROM web_sales_bucketed
> JOIN item ON ws_item_sk = i_item_sk;{code}
> Suppose web_sales_bucketed is bucketed into 4 buckets and set 
> shuffle.partition=10.
> Currently, both tables are shuffled into 10 partitions.
> {code:java}
> Execute CreateDataSourceTableAsSelectCommand 
> CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
> i_item_sk#6]
> +- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
> :- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(ws_item_sk#2, 10)
> : +- *(1) Project [ws_item_sk#2]
> : +- *(1) Filter isnotnull(ws_item_sk#2)
> : +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
> true, Format: Parquet, Location:...
> +- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(i_item_sk#6, 10)
> +- *(3) Project [i_item_sk#6]
> +- *(3) Filter isnotnull(i_item_sk#6)
> +- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
> Parquet, Location:...
> {code}
> A better plan should avoid the shuffle in the bucket table.
> {code:java}
> Execute CreateDataSourceTableAsSelectCommand 
> CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
> i_item_sk#6]
> +- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
> :- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
> : +- *(1) Project [ws_item_sk#2]
> : +- *(1) Filter isnotnull(ws_item_sk#2)
> : +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
> true, Format: Parquet, Location:...
> +- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(i_item_sk#6, 4)
> +- *(2) Project [i_item_sk#6]
> +- *(2) Filter isnotnull(i_item_sk#6)
> +- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
> Parquet, Location:...{code}
> This problem could be worse if we enable the adaptive execution, because it 
> usually prefers a big shuffle.parititon.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24343) Avoid shuffle for the bucketed table when shuffle.partition > bucket number

2018-05-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483723#comment-16483723
 ] 

Apache Spark commented on SPARK-24343:
--

User 'yucai' has created a pull request for this issue:
https://github.com/apache/spark/pull/21391

> Avoid shuffle for the bucketed table when shuffle.partition > bucket number
> ---
>
> Key: SPARK-24343
> URL: https://issues.apache.org/jira/browse/SPARK-24343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
>
> When shuffle.partition > bucket number, Spark needs to shuffle the bucket 
> table as per the shuffle.partition, can we avoid this?
> See below example:
> {code:java}
> CREATE TABLE dev
> USING PARQUET
> AS SELECT ws_item_sk, i_item_sk
> FROM web_sales_bucketed
> JOIN item ON ws_item_sk = i_item_sk;{code}
> Suppose web_sales_bucketed is bucketed into 4 buckets and set 
> shuffle.partition=10.
> Currently, both tables are shuffled into 10 partitions.
> {code:java}
> Execute CreateDataSourceTableAsSelectCommand 
> CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
> i_item_sk#6]
> +- *(5) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
> :- *(2) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(ws_item_sk#2, 10)
> : +- *(1) Project [ws_item_sk#2]
> : +- *(1) Filter isnotnull(ws_item_sk#2)
> : +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
> true, Format: Parquet, Location:...
> +- *(4) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(i_item_sk#6, 10)
> +- *(3) Project [i_item_sk#6]
> +- *(3) Filter isnotnull(i_item_sk#6)
> +- *(3) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
> Parquet, Location:...
> {code}
> A better plan should avoid the shuffle in the bucket table.
> {code:java}
> Execute CreateDataSourceTableAsSelectCommand 
> CreateDataSourceTableAsSelectCommand `dev`, ErrorIfExists, [ws_item_sk#2, 
> i_item_sk#6]
> +- *(4) SortMergeJoin [ws_item_sk#2], [i_item_sk#6], Inner
> :- *(1) Sort [ws_item_sk#2 ASC NULLS FIRST], false, 0
> : +- *(1) Project [ws_item_sk#2]
> : +- *(1) Filter isnotnull(ws_item_sk#2)
> : +- *(1) FileScan parquet tpcds10.web_sales_bucketed[ws_item_sk#2] Batched: 
> true, Format: Parquet, Location:...
> +- *(3) Sort [i_item_sk#6 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(i_item_sk#6, 4)
> +- *(2) Project [i_item_sk#6]
> +- *(2) Filter isnotnull(i_item_sk#6)
> +- *(2) FileScan parquet tpcds10.item[i_item_sk#6] Batched: true, Format: 
> Parquet, Location:...{code}
> This problem could be worse if we enable the adaptive execution, because it 
> usually prefers a big shuffle.parititon.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24321) Extract common code from Divide/Remainder to a base trait

2018-05-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24321.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21367
[https://github.com/apache/spark/pull/21367]

> Extract common code from Divide/Remainder to a base trait
> -
>
> Key: SPARK-24321
> URL: https://issues.apache.org/jira/browse/SPARK-24321
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Minor
> Fix For: 2.4.0
>
>
> There's a lot of code duplication between {{Divide}} and {{Remainder}} 
> expression types. They're mostly the codegen template (which is exactly the 
> same, with just cosmetic differences), the eval function structure, etc.
> It tedious to have to update multiple places in case we make improvements to 
> the codegen templates in the future. This ticket proposes to refactor the 
> duplicate code into a common base trait for these two classes.
> Non-goal: There another class, {{Pmod}}, that is also similiar to {{Divide}} 
> and {{Remainder}}, so in theory we can make a deeper refactoring to 
> accommodate this class as well. But the "operation" part of its codegen 
> template is harder to factor into the base trait, so this ticket only handles 
> {{Divide}} and {{Remainder}} for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24344) Spark SQL Thrift Server issue

2018-05-22 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483829#comment-16483829
 ] 

Marco Gaido commented on SPARK-24344:
-

I moved to Major as Critical and Blocker are reserved for committers. Moreover, 
I'd recommend you to upload the log file for the error you are reporting. 
Honestly, I don't think this JIRA is actionable with so few details. Thanks.

> Spark SQL Thrift Server issue
> -
>
> Key: SPARK-24344
> URL: https://issues.apache.org/jira/browse/SPARK-24344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: L
>Priority: Major
>
> I want to use spark thrift server to operate the data in hive, and i have 
> start spark thrift server successfully with port 10015,and the hive thrift 
> server port is 1 which is default.But when i use beeline to connect spark 
> thrift server,the error is below:
> !image-2018-05-22-18-15-55-200.png!
> The process of thrift server:
> !image-2018-05-22-18-17-01-271.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24341) Codegen compile error from predicate subquery

2018-05-22 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483846#comment-16483846
 ] 

Marco Gaido commented on SPARK-24341:
-

This is an issue in the Optimizer, rather than a codegen issue. I'll look at 
this in more details in the next days.

> Codegen compile error from predicate subquery
> -
>
> Key: SPARK-24341
> URL: https://issues.apache.org/jira/browse/SPARK-24341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Juliusz Sompolski
>Priority: Minor
>
> Ran on master:
> {code}
> drop table if exists juleka;
> drop table if exists julekb;
> create table juleka (a integer, b integer);
> create table julekb (na integer, nb integer);
> insert into juleka values (1,1);
> insert into julekb values (1,1);
> select * from juleka where (a, b) not in (select (na, nb) from julekb);
> {code}
> Results in:
> {code}
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 27, Column 29: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 27, Column 29: Cannot compare types "int" and 
> "org.apache.spark.sql.catalyst.InternalRow"
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
>   at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>   at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
>   at 
> com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344)
>   at 
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2316)
>   at 
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:3932)
>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1415)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:92)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:46)
>   at 
> org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:380)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:99)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:97)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>   at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>   at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:202)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49)
>   at 
> org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126)
>   at 
> org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:111)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
>   at 
> 

[jira] [Updated] (SPARK-24342) Large Task prior scheduling to Reduce overall execution time

2018-05-22 Thread gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gao updated SPARK-24342:

Attachment: taskstimespan.png

> Large Task prior scheduling to Reduce overall execution time
> 
>
> Key: SPARK-24342
> URL: https://issues.apache.org/jira/browse/SPARK-24342
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.3.0
>Reporter: gao
>Priority: Minor
> Attachments: taskstimespan.png
>
>
> When performing a set of concurrent tasks, if the relatively large task 
> (long-time task) performs the first small-task execution, the overall 
> execution time 
> can be shortened.
> Therefore, Spark needs to add a new function to perform Large Task of a group 
> of task set prior scheduling and small tasks after scheduling
>    The time span of the task can be identified based on the historical 
> execution time. We can think that the task with a long execution time will 
> longe in 
> future. Record the last task execution time together with the task's key as a 
> log file, and load the log file at the next execution time. use The 
> RangePartitioner and partitioning partitioning methods prioritize large tasks 
> and can achieve concurrent task optimization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24321) Extract common code from Divide/Remainder to a base trait

2018-05-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24321:
---

Assignee: Kris Mok

> Extract common code from Divide/Remainder to a base trait
> -
>
> Key: SPARK-24321
> URL: https://issues.apache.org/jira/browse/SPARK-24321
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Minor
> Fix For: 2.4.0
>
>
> There's a lot of code duplication between {{Divide}} and {{Remainder}} 
> expression types. They're mostly the codegen template (which is exactly the 
> same, with just cosmetic differences), the eval function structure, etc.
> It tedious to have to update multiple places in case we make improvements to 
> the codegen templates in the future. This ticket proposes to refactor the 
> duplicate code into a common base trait for these two classes.
> Non-goal: There another class, {{Pmod}}, that is also similiar to {{Divide}} 
> and {{Remainder}}, so in theory we can make a deeper refactoring to 
> accommodate this class as well. But the "operation" part of its codegen 
> template is harder to factor into the base trait, so this ticket only handles 
> {{Divide}} and {{Remainder}} for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24349) obtainDelegationTokens() exits JVM if Driver use JDBC instead of using metastore

2018-05-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484096#comment-16484096
 ] 

Apache Spark commented on SPARK-24349:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21396

> obtainDelegationTokens() exits JVM if Driver use JDBC instead of using 
> metastore 
> -
>
> Key: SPARK-24349
> URL: https://issues.apache.org/jira/browse/SPARK-24349
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Lantao Jin
>Priority: Major
>
> In [SPARK-23639|https://issues.apache.org/jira/browse/SPARK-23639], use 
> --proxy-user to impersonate will invoke obtainDelegationTokens(), but current 
> Driver use JDBC instead of metastore, it will failed out with
> {code}
> WARN HiveConf: HiveConf of name hive.server2.enable.impersonation does not 
> exist
> Exception in thread "main" java.lang.IllegalArgumentException: requirement 
> failed: Hive metastore uri undefined
> at scala.Predef$.require(Predef.scala:224)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveCredentialProvider.obtainCredentials(HiveCredentialProvider.scala:73)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:56)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:169)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:167)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:167)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 18/05/22 05:24:16 INFO ShutdownHookManager: Shutdown hook called
> 18/05/22 05:24:16 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-b63ad788-1a47-4326-9972-c4fde1dc19c3
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24324) UserDefinedFunction mixes column labels

2018-05-22 Thread Cristian Consonni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484138#comment-16484138
 ] 

Cristian Consonni commented on SPARK-24324:
---

[~hyukjin.kwon] said:
> Can you narrow down the problem? It roughly sounds like because names and 
> rows are mapped by position in CSV or Pandas UDF.

It is not just that, even if change the return statement in the funzion 
{code:python}concat_hours{code} to
{code:python}
return pd.DataFrame({'lang': x.lang, 'page': x.page, 'day': x.day,
 'enc': encoded_views_string}, index=[x.index[0]])
{code}
which is ordered in the same way as the schema I get mixed columns:

{noformat}
+--++---+-+ 
|  lang|page|day|  enc|
+--++---+-+
|2007-12-10|A150B148C197| en| Albert_Camus|
|2007-12-11|G1I1P3V1| en|Albert_Caquot|
|2007-12-10|  C1C1E1| en|Albert_Caquot|
|2007-12-11|U145V131W154X142| en| Albert_Camus|
+--++---+-+
{noformat}

The only way I can get the right values in the columns is with the following:
{code:python}
return pd.DataFrame({'enc': x.page, 'day': x.lang, 'lang': x.day,
 'page': encoded_views_string}, index=[x.index[0]])
{code}
which to me has absolutely no meaning and/or reason.

it took me several hours to debug this because I was getting all sorts of otehr 
errors in between (e.g type mismatches and the like).

> UserDefinedFunction mixes column labels
> ---
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> Codename: xenial
> {noformat}
>Reporter: Cristian Consonni
>Priority: Major
>
> I am working on Wikipedia page views (see [task T188041 on Wikimedia's 
> Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's 
> say that these are the data:
> {noformat}
>
> {noformat}
> For each combination of (lang, page, day(timestamp)) I need to transform  the 
>  views for each hour:
> {noformat}
> 00:00 -> A
> 01:00 -> B
> ...
> {noformat}
> and concatenate the number of views for that hour.  So, if a page got 5 views 
> at 00:00 and 7 views at 01:00 it would become:
> {noformat}
> A5B7
> {noformat}
>  
> I have written a UDF called {code:python}concat_hours{code}
> However, the function is mixing the columns and I am not sure what is going 
> on. I wrote here a minimal complete example that reproduces the issue on my 
> system (the details of my environment are above).
> {code:python}
> #!/usr/bin/env python3
> # coding: utf-8
> input_data = b"""en Albert_Camus 20071210-00 150
> en Albert_Camus 20071210-01 148
> en Albert_Camus 20071210-02 197
> en Albert_Camus 20071211-20 145
> en Albert_Camus 20071211-21 131
> en Albert_Camus 20071211-22 154
> en Albert_Camus 20071211-230001 142
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-040001 1
> en Albert_Caquot 20071211-06 1
> en Albert_Caquot 

[jira] [Assigned] (SPARK-24351) offsetLog/commitLog purge thresholdBatchId should be computed with current committed epoch but not currentBatchId in CP mode

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24351:


Assignee: (was: Apache Spark)

> offsetLog/commitLog purge thresholdBatchId should be computed with current 
> committed epoch but not currentBatchId in CP mode
> 
>
> Key: SPARK-24351
> URL: https://issues.apache.org/jira/browse/SPARK-24351
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: huangtengfei
>Priority: Major
>
> In structured streaming, there is a conf 
> spark.sql.streaming.minBatchesToRetain which is set to specify 'The minimum 
> number of batches that must be retained and made recoverable' as described in 
> [SQLConf](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L802).
>  In continuous processing, the metadata purge is triggered when an epoch is 
> committed in 
> [ContinuousExecution](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala#L306).
>  
> Since currentBatchId increases independently in cp mode, the current 
> committed epoch may be far behind currentBatchId if some task hangs for some 
> time. It is not safe to discard the metadata with thresholdBatchId computed 
> based on currentBatchId because we may clean all the metadata in the 
> checkpoint directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24351) offsetLog/commitLog purge thresholdBatchId should be computed with current committed epoch but not currentBatchId in CP mode

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24351:


Assignee: Apache Spark

> offsetLog/commitLog purge thresholdBatchId should be computed with current 
> committed epoch but not currentBatchId in CP mode
> 
>
> Key: SPARK-24351
> URL: https://issues.apache.org/jira/browse/SPARK-24351
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: huangtengfei
>Assignee: Apache Spark
>Priority: Major
>
> In structured streaming, there is a conf 
> spark.sql.streaming.minBatchesToRetain which is set to specify 'The minimum 
> number of batches that must be retained and made recoverable' as described in 
> [SQLConf](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L802).
>  In continuous processing, the metadata purge is triggered when an epoch is 
> committed in 
> [ContinuousExecution](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala#L306).
>  
> Since currentBatchId increases independently in cp mode, the current 
> committed epoch may be far behind currentBatchId if some task hangs for some 
> time. It is not safe to discard the metadata with thresholdBatchId computed 
> based on currentBatchId because we may clean all the metadata in the 
> checkpoint directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24350) ClassCastException in "array_position" function

2018-05-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484172#comment-16484172
 ] 

Apache Spark commented on SPARK-24350:
--

User 'wajda' has created a pull request for this issue:
https://github.com/apache/spark/pull/21401

> ClassCastException in "array_position" function
> ---
>
> Key: SPARK-24350
> URL: https://issues.apache.org/jira/browse/SPARK-24350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Alex Wajda
>Priority: Major
> Fix For: 2.4.0
>
>
> When calling {{array_position}} function with a wrong type of the 1st operand 
> a {{ClassCastException}} is thrown instead of {{AnalysisException}}
> Example:
> {code:sql}
> select array_position('foo', 'bar')
> {code}
> {noformat}
> java.lang.ClassCastException: org.apache.spark.sql.types.StringType$ cannot 
> be cast to org.apache.spark.sql.types.ArrayType
>   at 
> org.apache.spark.sql.catalyst.expressions.ArrayPosition.inputTypes(collectionOperations.scala:1398)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpectsInputTypes$class.checkInputDataTypes(ExpectsInputTypes.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.ArrayPosition.checkInputDataTypes(collectionOperations.scala:1401)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:256)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:252)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24347) df.alias() in python API should not clear metadata by default

2018-05-22 Thread Tomasz Bartczak (JIRA)
Tomasz Bartczak created SPARK-24347:
---

 Summary: df.alias() in python API should not clear metadata by 
default
 Key: SPARK-24347
 URL: https://issues.apache.org/jira/browse/SPARK-24347
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Tomasz Bartczak


currently when doing an alias on a column in pyspark I lose metadata:
{code:java}
print("just select = ", df.select(col("v")).schema.fields[0].metadata.keys())
print("select alias= ", 
df.select(col("v").alias("vv")).schema.fields[0].metadata.keys()){code}
gives:
{code:java}
just select =  dict_keys(['ml_attr'])
select alias=  dict_keys([]){code}
After looking at alias() documentation I see that metadata is an optional 
param. But it should not clear the metadata when it is not set. A default 
solution should be to keep it as-is.

Otherwise - it generates problems in a later part of the processing pipeline 
when someone is depending on the metadata.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24342) Large Task prior scheduling to Reduce overall execution time

2018-05-22 Thread gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gao updated SPARK-24342:

Description: 
When performing a set of concurrent tasks, if the relatively large task 
(long-time task) performs the first small-task execution, the overall execution 
time 
can be shortened.
Therefore, Spark needs to add a new function to perform Large Task of a group 
of task set prior scheduling and small tasks after scheduling
   The time span of the task can be identified based on the historical 
execution time. We can think that the task with a long execution time will 
longe in 
future. Record the last task execution time together with the task's key as a 
log file, and load the log file at the next execution time. use The 
RangePartitioner and partitioning partitioning methods prioritize large tasks 
and can achieve concurrent task optimization.

> Large Task prior scheduling to Reduce overall execution time
> 
>
> Key: SPARK-24342
> URL: https://issues.apache.org/jira/browse/SPARK-24342
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.3.0
>Reporter: gao
>Priority: Minor
>
> When performing a set of concurrent tasks, if the relatively large task 
> (long-time task) performs the first small-task execution, the overall 
> execution time 
> can be shortened.
> Therefore, Spark needs to add a new function to perform Large Task of a group 
> of task set prior scheduling and small tasks after scheduling
>    The time span of the task can be identified based on the historical 
> execution time. We can think that the task with a long execution time will 
> longe in 
> future. Record the last task execution time together with the task's key as a 
> log file, and load the log file at the next execution time. use The 
> RangePartitioner and partitioning partitioning methods prioritize large tasks 
> and can achieve concurrent task optimization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24342) Large Task prior scheduling to Reduce overall execution time

2018-05-22 Thread gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gao updated SPARK-24342:

Attachment: (was: tasktimespan.PNG)

> Large Task prior scheduling to Reduce overall execution time
> 
>
> Key: SPARK-24342
> URL: https://issues.apache.org/jira/browse/SPARK-24342
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.3.0
>Reporter: gao
>Priority: Minor
> Attachments: tasktimespan.PNG
>
>
> When performing a set of concurrent tasks, if the relatively large task 
> (long-time task) performs the first small-task execution, the overall 
> execution time 
> can be shortened.
> Therefore, Spark needs to add a new function to perform Large Task of a group 
> of task set prior scheduling and small tasks after scheduling
>    The time span of the task can be identified based on the historical 
> execution time. We can think that the task with a long execution time will 
> longe in 
> future. Record the last task execution time together with the task's key as a 
> log file, and load the log file at the next execution time. use The 
> RangePartitioner and partitioning partitioning methods prioritize large tasks 
> and can achieve concurrent task optimization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24063) Control maximum epoch backlog

2018-05-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483796#comment-16483796
 ] 

Apache Spark commented on SPARK-24063:
--

User 'efimpoberezkin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21392

> Control maximum epoch backlog
> -
>
> Key: SPARK-24063
> URL: https://issues.apache.org/jira/browse/SPARK-24063
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Efim Poberezkin
>Priority: Major
>
> As pointed out by [~joseph.torres] in 
> [https://github.com/apache/spark/pull/20936], both epoch queue and 
> commits/offsets maps are unbounded by the number of waiting epochs. According 
> to his proposal, we should introduce some configuration option for maximum 
> epoch backlog and report an error if the number of waiting epochs exceeds it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2018-05-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483830#comment-16483830
 ] 

Apache Spark commented on SPARK-20114:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/21393

> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. The PrefixSpanModel is only used to provide 
> access for frequent sequential patterns.
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  
> The rules extracted are of the form X–> Y where X and Y are sequential 
> patterns. But in practice, these rules are not very good as they are too 
> precise and thus not noise tolerant.
> #  Different from association rules and frequent itemsets, sequential rules 
> can be extracted from the original dataset more efficiently using algorithms 
> like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
> unordered, but X must appear before Y, which is more general and can work 
> better in practice for prediction. 
> I'd like to hear more from the users to see which kind of Sequential rules 
> are more practical. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24349) obtainDelegationTokens() exits JVM if Driver use JDBC instead of using metastore

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24349:


Assignee: (was: Apache Spark)

> obtainDelegationTokens() exits JVM if Driver use JDBC instead of using 
> metastore 
> -
>
> Key: SPARK-24349
> URL: https://issues.apache.org/jira/browse/SPARK-24349
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Lantao Jin
>Priority: Major
>
> In [SPARK-23639|https://issues.apache.org/jira/browse/SPARK-23639], use 
> --proxy-user to impersonate will invoke obtainDelegationTokens(), but current 
> Driver use JDBC instead of metastore, it will failed out with
> {code}
> WARN HiveConf: HiveConf of name hive.server2.enable.impersonation does not 
> exist
> Exception in thread "main" java.lang.IllegalArgumentException: requirement 
> failed: Hive metastore uri undefined
> at scala.Predef$.require(Predef.scala:224)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveCredentialProvider.obtainCredentials(HiveCredentialProvider.scala:73)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:56)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:169)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:167)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:167)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 18/05/22 05:24:16 INFO ShutdownHookManager: Shutdown hook called
> 18/05/22 05:24:16 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-b63ad788-1a47-4326-9972-c4fde1dc19c3
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24349) obtainDelegationTokens() exits JVM if Driver use JDBC instead of using metastore

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24349:


Assignee: Apache Spark

> obtainDelegationTokens() exits JVM if Driver use JDBC instead of using 
> metastore 
> -
>
> Key: SPARK-24349
> URL: https://issues.apache.org/jira/browse/SPARK-24349
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Major
>
> In [SPARK-23639|https://issues.apache.org/jira/browse/SPARK-23639], use 
> --proxy-user to impersonate will invoke obtainDelegationTokens(), but current 
> Driver use JDBC instead of metastore, it will failed out with
> {code}
> WARN HiveConf: HiveConf of name hive.server2.enable.impersonation does not 
> exist
> Exception in thread "main" java.lang.IllegalArgumentException: requirement 
> failed: Hive metastore uri undefined
> at scala.Predef$.require(Predef.scala:224)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveCredentialProvider.obtainCredentials(HiveCredentialProvider.scala:73)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:56)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:169)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:167)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:167)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 18/05/22 05:24:16 INFO ShutdownHookManager: Shutdown hook called
> 18/05/22 05:24:16 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-b63ad788-1a47-4326-9972-c4fde1dc19c3
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22269) Java style checks should be run in Jenkins

2018-05-22 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22269:
-
Priority: Major  (was: Minor)

> Java style checks should be run in Jenkins
> --
>
> Key: SPARK-22269
> URL: https://issues.apache.org/jira/browse/SPARK-22269
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>Priority: Major
>
> A few times now I've gone to build the master branch and it's failed due to 
> Java style errors, which I've sent in PRs to fix:
> - https://issues.apache.org/jira/browse/SPARK-22268
> - https://issues.apache.org/jira/browse/SPARK-21875
> Digging through the history a bit, it looks like this check used to run on 
> Jenkins and was previously enabled at 
> https://github.com/apache/spark/pull/10763 but then reverted at 
> https://github.com/apache/spark/commit/4bcea1b8595424678aa6c92d66ba08c92e0fefe5
> We should work out what it takes to enable the Java check in Jenkins so these 
> kinds of errors are caught in CI rather than afterwards post-merge.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24334) Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory allocator

2018-05-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484130#comment-16484130
 ] 

Apache Spark commented on SPARK-24334:
--

User 'icexelloss' has created a pull request for this issue:
https://github.com/apache/spark/pull/21397

> Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory 
> allocator
> -
>
> Key: SPARK-24334
> URL: https://issues.apache.org/jira/browse/SPARK-24334
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Major
>
> Currently, ArrowPythonRunner has two thread that frees the Arrow vector 
> schema root and allocator - The main writer thread and task completion 
> listener thread. 
> Having both thread doing the clean up leads to weird case (e.g., negative ref 
> cnt, NPE, and memory leak exception) when an exceptions are thrown from the 
> user function.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24334) Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory allocator

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24334:


Assignee: Apache Spark

> Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory 
> allocator
> -
>
> Key: SPARK-24334
> URL: https://issues.apache.org/jira/browse/SPARK-24334
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Assignee: Apache Spark
>Priority: Major
>
> Currently, ArrowPythonRunner has two thread that frees the Arrow vector 
> schema root and allocator - The main writer thread and task completion 
> listener thread. 
> Having both thread doing the clean up leads to weird case (e.g., negative ref 
> cnt, NPE, and memory leak exception) when an exceptions are thrown from the 
> user function.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24324) UserDefinedFunction mixes column labels

2018-05-22 Thread Cristian Consonni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484138#comment-16484138
 ] 

Cristian Consonni edited comment on SPARK-24324 at 5/22/18 3:30 PM:


[~hyukjin.kwon] said:
> Can you narrow down the problem? It roughly sounds like because names and 
> rows are mapped by position in CSV or Pandas UDF.

It is not just that, even if I change the return statement in the funzion 
{code:python}concat_hours{code} to
{code:python}
return pd.DataFrame({'lang': x.lang, 'page': x.page, 'day': x.day,
 'enc': encoded_views_string}, index=[x.index[0]])
{code}
which is ordered in the same way as the schema I get mixed columns:

{noformat}
+--++---+-+ 
|  lang|page|day|  enc|
+--++---+-+
|2007-12-10|A150B148C197| en| Albert_Camus|
|2007-12-11|G1I1P3V1| en|Albert_Caquot|
|2007-12-10|  C1C1E1| en|Albert_Caquot|
|2007-12-11|U145V131W154X142| en| Albert_Camus|
+--++---+-+
{noformat}

The only way I can get the right values in the columns is with the following:
{code:python}
return pd.DataFrame({'enc': x.page, 'day': x.lang, 'lang': x.day,
 'page': encoded_views_string}, index=[x.index[0]])
{code}
which to me has absolutely no meaning and/or reason.

it took me several hours to debug this because I was getting all sorts of otehr 
errors in between (e.g type mismatches and the like).


was (Author: cristiancantoro):
[~hyukjin.kwon] said:
> Can you narrow down the problem? It roughly sounds like because names and 
> rows are mapped by position in CSV or Pandas UDF.

It is not just that, even if change the return statement in the funzion 
{code:python}concat_hours{code} to
{code:python}
return pd.DataFrame({'lang': x.lang, 'page': x.page, 'day': x.day,
 'enc': encoded_views_string}, index=[x.index[0]])
{code}
which is ordered in the same way as the schema I get mixed columns:

{noformat}
+--++---+-+ 
|  lang|page|day|  enc|
+--++---+-+
|2007-12-10|A150B148C197| en| Albert_Camus|
|2007-12-11|G1I1P3V1| en|Albert_Caquot|
|2007-12-10|  C1C1E1| en|Albert_Caquot|
|2007-12-11|U145V131W154X142| en| Albert_Camus|
+--++---+-+
{noformat}

The only way I can get the right values in the columns is with the following:
{code:python}
return pd.DataFrame({'enc': x.page, 'day': x.lang, 'lang': x.day,
 'page': encoded_views_string}, index=[x.index[0]])
{code}
which to me has absolutely no meaning and/or reason.

it took me several hours to debug this because I was getting all sorts of otehr 
errors in between (e.g type mismatches and the like).

> UserDefinedFunction mixes column labels
> ---
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> 

[jira] [Commented] (SPARK-22269) Java style checks should be run in Jenkins

2018-05-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484151#comment-16484151
 ] 

Apache Spark commented on SPARK-22269:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/21399

> Java style checks should be run in Jenkins
> --
>
> Key: SPARK-22269
> URL: https://issues.apache.org/jira/browse/SPARK-22269
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>Priority: Major
>
> A few times now I've gone to build the master branch and it's failed due to 
> Java style errors, which I've sent in PRs to fix:
> - https://issues.apache.org/jira/browse/SPARK-22268
> - https://issues.apache.org/jira/browse/SPARK-21875
> Digging through the history a bit, it looks like this check used to run on 
> Jenkins and was previously enabled at 
> https://github.com/apache/spark/pull/10763 but then reverted at 
> https://github.com/apache/spark/commit/4bcea1b8595424678aa6c92d66ba08c92e0fefe5
> We should work out what it takes to enable the Java check in Jenkins so these 
> kinds of errors are caught in CI rather than afterwards post-merge.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22269) Java style checks should be run in Jenkins

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22269:


Assignee: (was: Apache Spark)

> Java style checks should be run in Jenkins
> --
>
> Key: SPARK-22269
> URL: https://issues.apache.org/jira/browse/SPARK-22269
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>Priority: Major
>
> A few times now I've gone to build the master branch and it's failed due to 
> Java style errors, which I've sent in PRs to fix:
> - https://issues.apache.org/jira/browse/SPARK-22268
> - https://issues.apache.org/jira/browse/SPARK-21875
> Digging through the history a bit, it looks like this check used to run on 
> Jenkins and was previously enabled at 
> https://github.com/apache/spark/pull/10763 but then reverted at 
> https://github.com/apache/spark/commit/4bcea1b8595424678aa6c92d66ba08c92e0fefe5
> We should work out what it takes to enable the Java check in Jenkins so these 
> kinds of errors are caught in CI rather than afterwards post-merge.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22269) Java style checks should be run in Jenkins

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22269:


Assignee: Apache Spark

> Java style checks should be run in Jenkins
> --
>
> Key: SPARK-22269
> URL: https://issues.apache.org/jira/browse/SPARK-22269
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>Assignee: Apache Spark
>Priority: Major
>
> A few times now I've gone to build the master branch and it's failed due to 
> Java style errors, which I've sent in PRs to fix:
> - https://issues.apache.org/jira/browse/SPARK-22268
> - https://issues.apache.org/jira/browse/SPARK-21875
> Digging through the history a bit, it looks like this check used to run on 
> Jenkins and was previously enabled at 
> https://github.com/apache/spark/pull/10763 but then reverted at 
> https://github.com/apache/spark/commit/4bcea1b8595424678aa6c92d66ba08c92e0fefe5
> We should work out what it takes to enable the Java check in Jenkins so these 
> kinds of errors are caught in CI rather than afterwards post-merge.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24350) ClassCastException in "array_position" function

2018-05-22 Thread Alex Wajda (JIRA)
Alex Wajda created SPARK-24350:
--

 Summary: ClassCastException in "array_position" function
 Key: SPARK-24350
 URL: https://issues.apache.org/jira/browse/SPARK-24350
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Alex Wajda
 Fix For: 2.4.0


When calling {{array_position}} function with a wrong type of the 1st operand a 
{{ClassCastException}} is thrown instead of {{AnalysisException}}

Example:

{code:sql}
select array_position('foo', 'bar')
{code}

{noformat}
java.lang.ClassCastException: org.apache.spark.sql.types.StringType$ cannot be 
cast to org.apache.spark.sql.types.ArrayType
at 
org.apache.spark.sql.catalyst.expressions.ArrayPosition.inputTypes(collectionOperations.scala:1398)
at 
org.apache.spark.sql.catalyst.expressions.ExpectsInputTypes$class.checkInputDataTypes(ExpectsInputTypes.scala:44)
at 
org.apache.spark.sql.catalyst.expressions.ArrayPosition.checkInputDataTypes(collectionOperations.scala:1401)
at 
org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:168)
at 
org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:168)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:256)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:252)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24351) offsetLog/commitLog purge thresholdBatchId should be computed with current committed epoch but not currentBatchId in CP mode

2018-05-22 Thread huangtengfei (JIRA)
huangtengfei created SPARK-24351:


 Summary: offsetLog/commitLog purge thresholdBatchId should be 
computed with current committed epoch but not currentBatchId in CP mode
 Key: SPARK-24351
 URL: https://issues.apache.org/jira/browse/SPARK-24351
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: huangtengfei


In structured streaming, there is a conf spark.sql.streaming.minBatchesToRetain 
which is set to specify 'The minimum number of batches that must be retained 
and made recoverable' as described in 
[SQLConf](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L802).
 In continuous processing, the metadata purge is triggered when an epoch is 
committed in 
[ContinuousExecution](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala#L306).
 
Since currentBatchId increases independently in cp mode, the current committed 
epoch may be far behind currentBatchId if some task hangs for some time. It is 
not safe to discard the metadata with thresholdBatchId computed based on 
currentBatchId because we may clean all the metadata in the checkpoint 
directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24351) offsetLog/commitLog purge thresholdBatchId should be computed with current committed epoch but not currentBatchId in CP mode

2018-05-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484173#comment-16484173
 ] 

Apache Spark commented on SPARK-24351:
--

User 'ivoson' has created a pull request for this issue:
https://github.com/apache/spark/pull/21400

> offsetLog/commitLog purge thresholdBatchId should be computed with current 
> committed epoch but not currentBatchId in CP mode
> 
>
> Key: SPARK-24351
> URL: https://issues.apache.org/jira/browse/SPARK-24351
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: huangtengfei
>Priority: Major
>
> In structured streaming, there is a conf 
> spark.sql.streaming.minBatchesToRetain which is set to specify 'The minimum 
> number of batches that must be retained and made recoverable' as described in 
> [SQLConf](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L802).
>  In continuous processing, the metadata purge is triggered when an epoch is 
> committed in 
> [ContinuousExecution](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala#L306).
>  
> Since currentBatchId increases independently in cp mode, the current 
> committed epoch may be far behind currentBatchId if some task hangs for some 
> time. It is not safe to discard the metadata with thresholdBatchId computed 
> based on currentBatchId because we may clean all the metadata in the 
> checkpoint directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24350) ClassCastException in "array_position" function

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24350:


Assignee: Apache Spark

> ClassCastException in "array_position" function
> ---
>
> Key: SPARK-24350
> URL: https://issues.apache.org/jira/browse/SPARK-24350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Alex Wajda
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.4.0
>
>
> When calling {{array_position}} function with a wrong type of the 1st operand 
> a {{ClassCastException}} is thrown instead of {{AnalysisException}}
> Example:
> {code:sql}
> select array_position('foo', 'bar')
> {code}
> {noformat}
> java.lang.ClassCastException: org.apache.spark.sql.types.StringType$ cannot 
> be cast to org.apache.spark.sql.types.ArrayType
>   at 
> org.apache.spark.sql.catalyst.expressions.ArrayPosition.inputTypes(collectionOperations.scala:1398)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpectsInputTypes$class.checkInputDataTypes(ExpectsInputTypes.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.ArrayPosition.checkInputDataTypes(collectionOperations.scala:1401)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:256)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:252)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24350) ClassCastException in "array_position" function

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24350:


Assignee: (was: Apache Spark)

> ClassCastException in "array_position" function
> ---
>
> Key: SPARK-24350
> URL: https://issues.apache.org/jira/browse/SPARK-24350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Alex Wajda
>Priority: Major
> Fix For: 2.4.0
>
>
> When calling {{array_position}} function with a wrong type of the 1st operand 
> a {{ClassCastException}} is thrown instead of {{AnalysisException}}
> Example:
> {code:sql}
> select array_position('foo', 'bar')
> {code}
> {noformat}
> java.lang.ClassCastException: org.apache.spark.sql.types.StringType$ cannot 
> be cast to org.apache.spark.sql.types.ArrayType
>   at 
> org.apache.spark.sql.catalyst.expressions.ArrayPosition.inputTypes(collectionOperations.scala:1398)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpectsInputTypes$class.checkInputDataTypes(ExpectsInputTypes.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.ArrayPosition.checkInputDataTypes(collectionOperations.scala:1401)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:256)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:252)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24345) Improve ParseError stop location when offending symbol is a token

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24345:


Assignee: Apache Spark

> Improve ParseError stop location when offending symbol is a token
> -
>
> Key: SPARK-24345
> URL: https://issues.apache.org/jira/browse/SPARK-24345
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Ruben Fiszel
>Assignee: Apache Spark
>Priority: Minor
>
> In the case where the offending symbol of a syntaxError is a CommonToken, 
> this PR increases the accuracy of the start and stop origin by leveraging the 
> start and stop index information from CommonToken in the syntax error 
> listener.
>  
> [Github PR](https://github.com/apache/spark/pull/21334)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24345) Improve ParseError stop location when offending symbol is a token

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24345:


Assignee: (was: Apache Spark)

> Improve ParseError stop location when offending symbol is a token
> -
>
> Key: SPARK-24345
> URL: https://issues.apache.org/jira/browse/SPARK-24345
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Ruben Fiszel
>Priority: Minor
>
> In the case where the offending symbol of a syntaxError is a CommonToken, 
> this PR increases the accuracy of the start and stop origin by leveraging the 
> start and stop index information from CommonToken in the syntax error 
> listener.
>  
> [Github PR](https://github.com/apache/spark/pull/21334)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24345) Improve ParseError stop location when offending symbol is a token

2018-05-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483895#comment-16483895
 ] 

Apache Spark commented on SPARK-24345:
--

User 'rubenfiszel' has created a pull request for this issue:
https://github.com/apache/spark/pull/21334

> Improve ParseError stop location when offending symbol is a token
> -
>
> Key: SPARK-24345
> URL: https://issues.apache.org/jira/browse/SPARK-24345
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Ruben Fiszel
>Priority: Minor
>
> In the case where the offending symbol of a syntaxError is a CommonToken, 
> this PR increases the accuracy of the start and stop origin by leveraging the 
> start and stop index information from CommonToken in the syntax error 
> listener.
>  
> [Github PR](https://github.com/apache/spark/pull/21334)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20087) Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd listeners

2018-05-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20087:
---

Assignee: Xianjin YE

> Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd 
> listeners
> -
>
> Key: SPARK-20087
> URL: https://issues.apache.org/jira/browse/SPARK-20087
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Charles Lewis
>Assignee: Xianjin YE
>Priority: Major
>
> When tasks end due to an ExceptionFailure, subscribers to onTaskEnd receive 
> accumulators / task metrics for that task, if they were still available. 
> These metrics are not currently sent when tasks are killed intentionally, 
> such as when a speculative retry finishes, and the original is killed (or 
> vice versa). Since we're killing these tasks ourselves, these metrics should 
> almost always exist, and we should treat them the same way as we treat 
> ExceptionFailures.
> Sending these metrics with the TaskKilled end reason makes aggregation across 
> all tasks in an app more accurate. This data can inform decisions about how 
> to tune the speculation parameters in order to minimize duplicated work, and 
> in general, the total cost of an app should include both successful and 
> failed tasks, if that information exists.
> PR: https://github.com/apache/spark/pull/17422



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21673) Spark local directory is not set correctly

2018-05-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21673.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 18894
[https://github.com/apache/spark/pull/18894]

> Spark local directory is not set correctly
> --
>
> Key: SPARK-21673
> URL: https://issues.apache.org/jira/browse/SPARK-21673
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Mesos
>Affects Versions: 2.2.0
>Reporter: Jake Charland
>Assignee: Jake Charland
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently the way Spark discovers the Mesos sandbox is wrong. As seen here 
> https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/util/Utils.scala#L806
>  it is checking the env variable called MESOS_DIRECTORY however this env 
> variable was depricated (see 
> https://www.mail-archive.com/dev@mesos.apache.org/msg36621.html) in favor of 
> using MESOS_SANDBOX env variable. This should be updated in the spark code to 
> reflect this change in mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21673) Spark local directory is not set correctly

2018-05-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21673:
-

Assignee: Jake Charland

> Spark local directory is not set correctly
> --
>
> Key: SPARK-21673
> URL: https://issues.apache.org/jira/browse/SPARK-21673
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Mesos
>Affects Versions: 2.2.0
>Reporter: Jake Charland
>Assignee: Jake Charland
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently the way Spark discovers the Mesos sandbox is wrong. As seen here 
> https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/util/Utils.scala#L806
>  it is checking the env variable called MESOS_DIRECTORY however this env 
> variable was depricated (see 
> https://www.mail-archive.com/dev@mesos.apache.org/msg36621.html) in favor of 
> using MESOS_SANDBOX env variable. This should be updated in the spark code to 
> reflect this change in mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24313) Collection functions interpreted execution doesn't work with complex types

2018-05-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24313.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21361
[https://github.com/apache/spark/pull/21361]

> Collection functions interpreted execution doesn't work with complex types
> --
>
> Key: SPARK-24313
> URL: https://issues.apache.org/jira/browse/SPARK-24313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Critical
>  Labels: correctness
> Fix For: 2.4.0
>
>
> Several functions working on collection return incorrect result for complex 
> data types in interpreted mode. In particular, we consider comple data types 
> BINARY, ARRAY. The list of the affected functions is: {{array_contains}}, 
> {{array_position}}, {{element_at}} and {{GetMapValue}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24313) Collection functions interpreted execution doesn't work with complex types

2018-05-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24313:
---

Assignee: Marco Gaido

> Collection functions interpreted execution doesn't work with complex types
> --
>
> Key: SPARK-24313
> URL: https://issues.apache.org/jira/browse/SPARK-24313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Critical
>  Labels: correctness
> Fix For: 2.4.0
>
>
> Several functions working on collection return incorrect result for complex 
> data types in interpreted mode. In particular, we consider comple data types 
> BINARY, ARRAY. The list of the affected functions is: {{array_contains}}, 
> {{array_position}}, {{element_at}} and {{GetMapValue}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24334) Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory allocator

2018-05-22 Thread Mateusz Pieniak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483991#comment-16483991
 ] 

Mateusz Pieniak edited comment on SPARK-24334 at 5/22/18 1:59 PM:
--

I came across with this issue while running my custom apply function on larger 
dataset. I got the exception:
{code:java}
SparkException: Job aborted due to stage failure: Task 0 in stage 43.0 failed 4 
times, most recent failure: Lost task 0.3 in stage 43.0 (TID 3108, 
10.217.183.141, executor 3): 
org.apache.spark.util.TaskCompletionListenerException: Memory was leaked by 
query. Memory leaked: (482816) Allocator(stdout writer for 
/databricks/python/bin/python) 0/482816/482816/9223372036854775807 
(res/actual/peak/limit) at 
org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:153) at 
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:131) 
at org.apache.spark.scheduler.Task.run(Task.scala:127) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:350) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748){code}
 


was (Author: pi3ni0):
I came across with this issue while running my custom apply function on larger 
dataset. I got the exception:

 
 
{code:java}
SparkException: Job aborted due to stage failure: Task 0 in stage 43.0 failed 4 
times, most recent failure: Lost task 0.3 in stage 43.0 (TID 3108, 
10.217.183.141, executor 3): 
org.apache.spark.util.TaskCompletionListenerException: Memory was leaked by 
query. Memory leaked: (482816) Allocator(stdout writer for 
/databricks/python/bin/python) 0/482816/482816/9223372036854775807 
(res/actual/peak/limit) at 
org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:153) at 
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:131) 
at org.apache.spark.scheduler.Task.run(Task.scala:127) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:350) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748){code}
 

> Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory 
> allocator
> -
>
> Key: SPARK-24334
> URL: https://issues.apache.org/jira/browse/SPARK-24334
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Major
>
> Currently, ArrowPythonRunner has two thread that frees the Arrow vector 
> schema root and allocator - The main writer thread and task completion 
> listener thread. 
> Having both thread doing the clean up leads to weird case (e.g., negative ref 
> cnt, NPE, and memory leak exception) when an exceptions are thrown from the 
> user function.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24334) Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory allocator

2018-05-22 Thread Mateusz Pieniak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483991#comment-16483991
 ] 

Mateusz Pieniak edited comment on SPARK-24334 at 5/22/18 1:59 PM:
--

I came across with this issue while running my custom apply function on larger 
dataset. It works on smaller dataset. I got the exception:
{code:java}
SparkException: Job aborted due to stage failure: Task 0 in stage 43.0 failed 4 
times, most recent failure: Lost task 0.3 in stage 43.0 (TID 3108, 
10.217.183.141, executor 3): 
org.apache.spark.util.TaskCompletionListenerException: Memory was leaked by 
query. Memory leaked: (482816) Allocator(stdout writer for 
/databricks/python/bin/python) 0/482816/482816/9223372036854775807 
(res/actual/peak/limit) at 
org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:153) at 
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:131) 
at org.apache.spark.scheduler.Task.run(Task.scala:127) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:350) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748){code}
 


was (Author: pi3ni0):
I came across with this issue while running my custom apply function on larger 
dataset. I got the exception:
{code:java}
SparkException: Job aborted due to stage failure: Task 0 in stage 43.0 failed 4 
times, most recent failure: Lost task 0.3 in stage 43.0 (TID 3108, 
10.217.183.141, executor 3): 
org.apache.spark.util.TaskCompletionListenerException: Memory was leaked by 
query. Memory leaked: (482816) Allocator(stdout writer for 
/databricks/python/bin/python) 0/482816/482816/9223372036854775807 
(res/actual/peak/limit) at 
org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:153) at 
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:131) 
at org.apache.spark.scheduler.Task.run(Task.scala:127) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:350) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748){code}
 

> Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory 
> allocator
> -
>
> Key: SPARK-24334
> URL: https://issues.apache.org/jira/browse/SPARK-24334
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Major
>
> Currently, ArrowPythonRunner has two thread that frees the Arrow vector 
> schema root and allocator - The main writer thread and task completion 
> listener thread. 
> Having both thread doing the clean up leads to weird case (e.g., negative ref 
> cnt, NPE, and memory leak exception) when an exceptions are thrown from the 
> user function.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24334) Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory allocator

2018-05-22 Thread Mateusz Pieniak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483991#comment-16483991
 ] 

Mateusz Pieniak commented on SPARK-24334:
-

I came across with this issue while running my custom apply function on larger 
dataset. I got the exception:

 
 
{code:java}
SparkException: Job aborted due to stage failure: Task 0 in stage 43.0 failed 4 
times, most recent failure: Lost task 0.3 in stage 43.0 (TID 3108, 
10.217.183.141, executor 3): 
org.apache.spark.util.TaskCompletionListenerException: Memory was leaked by 
query. Memory leaked: (482816) Allocator(stdout writer for 
/databricks/python/bin/python) 0/482816/482816/9223372036854775807 
(res/actual/peak/limit) at 
org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:153) at 
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:131) 
at org.apache.spark.scheduler.Task.run(Task.scala:127) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:350) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748){code}
 

> Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory 
> allocator
> -
>
> Key: SPARK-24334
> URL: https://issues.apache.org/jira/browse/SPARK-24334
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Major
>
> Currently, ArrowPythonRunner has two thread that frees the Arrow vector 
> schema root and allocator - The main writer thread and task completion 
> listener thread. 
> Having both thread doing the clean up leads to weird case (e.g., negative ref 
> cnt, NPE, and memory leak exception) when an exceptions are thrown from the 
> user function.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24244) Parse only required columns of CSV file

2018-05-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24244.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21296
[https://github.com/apache/spark/pull/21296]

> Parse only required columns of CSV file
> ---
>
> Key: SPARK-24244
> URL: https://issues.apache.org/jira/browse/SPARK-24244
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> uniVocity parser allows to specify only required column names or indexes for 
> parsing like:
> {code}
> // Here we select only the columns by their indexes.
> // The parser just skips the values in other columns
> parserSettings.selectIndexes(4, 0, 1);
> CsvParser parser = new CsvParser(parserSettings);
> {code}
> Need to modify *UnivocityParser* to extract only needed columns from 
> requiredSchema



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24348) scala.MatchError in the "element_at" expression

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24348:


Assignee: Apache Spark

> scala.MatchError in the "element_at" expression
> ---
>
> Key: SPARK-24348
> URL: https://issues.apache.org/jira/browse/SPARK-24348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Alex Wajda
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.4.0
>
>
> When calling {{element_at}} with a wrong first operand type a 
> {{scala.MatchError}} is thrown instead of {{AnalysisException}}
> *Example:*
> {code:sql}
> select element_at('foo', 1)
> {code}
> results in:
> {noformat}
> scala.MatchError: StringType (of class org.apache.spark.sql.types.StringType$)
>   at 
> org.apache.spark.sql.catalyst.expressions.ElementAt.inputTypes(collectionOperations.scala:1469)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpectsInputTypes$class.checkInputDataTypes(ExpectsInputTypes.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.ElementAt.checkInputDataTypes(collectionOperations.scala:1478)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:256)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:252)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24334) Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory allocator

2018-05-22 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484086#comment-16484086
 ] 

Li Jin commented on SPARK-24334:


[~pi3ni0] did it happen for you when your UDF throws exception?

> Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory 
> allocator
> -
>
> Key: SPARK-24334
> URL: https://issues.apache.org/jira/browse/SPARK-24334
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Major
>
> Currently, ArrowPythonRunner has two thread that frees the Arrow vector 
> schema root and allocator - The main writer thread and task completion 
> listener thread. 
> Having both thread doing the clean up leads to weird case (e.g., negative ref 
> cnt, NPE, and memory leak exception) when an exceptions are thrown from the 
> user function.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24345) Improve ParseError stop location when offending symbol is a token

2018-05-22 Thread Ruben Fiszel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruben Fiszel updated SPARK-24345:
-
Description: 
In the case where the offending symbol of a syntaxError is a CommonToken, this 
PR increases the accuracy of the start and stop origin by leveraging the start 
and stop index information from CommonToken in the syntax error listener.

[Github PR|https://github.com/apache/spark/pull/21334]

  was:
In the case where the offending symbol of a syntaxError is a CommonToken, this 
PR increases the accuracy of the start and stop origin by leveraging the start 
and stop index information from CommonToken in the syntax error listener.

 

[Github PR](https://github.com/apache/spark/pull/21334)


> Improve ParseError stop location when offending symbol is a token
> -
>
> Key: SPARK-24345
> URL: https://issues.apache.org/jira/browse/SPARK-24345
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Ruben Fiszel
>Priority: Minor
>
> In the case where the offending symbol of a syntaxError is a CommonToken, 
> this PR increases the accuracy of the start and stop origin by leveraging the 
> start and stop index information from CommonToken in the syntax error 
> listener.
> [Github PR|https://github.com/apache/spark/pull/21334]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20087) Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd listeners

2018-05-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20087.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21165
[https://github.com/apache/spark/pull/21165]

> Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd 
> listeners
> -
>
> Key: SPARK-20087
> URL: https://issues.apache.org/jira/browse/SPARK-20087
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Charles Lewis
>Assignee: Xianjin YE
>Priority: Major
> Fix For: 2.4.0
>
>
> When tasks end due to an ExceptionFailure, subscribers to onTaskEnd receive 
> accumulators / task metrics for that task, if they were still available. 
> These metrics are not currently sent when tasks are killed intentionally, 
> such as when a speculative retry finishes, and the original is killed (or 
> vice versa). Since we're killing these tasks ourselves, these metrics should 
> almost always exist, and we should treat them the same way as we treat 
> ExceptionFailures.
> Sending these metrics with the TaskKilled end reason makes aggregation across 
> all tasks in an app more accurate. This data can inform decisions about how 
> to tune the speculation parameters in order to minimize duplicated work, and 
> in general, the total cost of an app should include both successful and 
> failed tasks, if that information exists.
> PR: https://github.com/apache/spark/pull/17422



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24244) Parse only required columns of CSV file

2018-05-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24244:
---

Assignee: Maxim Gekk

> Parse only required columns of CSV file
> ---
>
> Key: SPARK-24244
> URL: https://issues.apache.org/jira/browse/SPARK-24244
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> uniVocity parser allows to specify only required column names or indexes for 
> parsing like:
> {code}
> // Here we select only the columns by their indexes.
> // The parser just skips the values in other columns
> parserSettings.selectIndexes(4, 0, 1);
> CsvParser parser = new CsvParser(parserSettings);
> {code}
> Need to modify *UnivocityParser* to extract only needed columns from 
> requiredSchema



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24349) obtainDelegationTokens() exits JVM if Driver use JDBC instead of using metastore

2018-05-22 Thread Lantao Jin (JIRA)
Lantao Jin created SPARK-24349:
--

 Summary: obtainDelegationTokens() exits JVM if Driver use JDBC 
instead of using metastore 
 Key: SPARK-24349
 URL: https://issues.apache.org/jira/browse/SPARK-24349
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Lantao Jin


In [SPARK-23639|https://issues.apache.org/jira/browse/SPARK-23639], use 
--proxy-user to impersonate will invoke obtainDelegationTokens(), but current 
Driver use JDBC instead of metastore, it will failed out with
{code}
WARN HiveConf: HiveConf of name hive.server2.enable.impersonation does not exist
Exception in thread "main" java.lang.IllegalArgumentException: requirement 
failed: Hive metastore uri undefined
at scala.Predef$.require(Predef.scala:224)
at 
org.apache.spark.sql.hive.thriftserver.HiveCredentialProvider.obtainCredentials(HiveCredentialProvider.scala:73)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:56)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at 
org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:169)
at 
org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/05/22 05:24:16 INFO ShutdownHookManager: Shutdown hook called
18/05/22 05:24:16 INFO ShutdownHookManager: Deleting directory 
/tmp/spark-b63ad788-1a47-4326-9972-c4fde1dc19c3
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24345) Improve ParseError stop location when offending symbol is a token

2018-05-22 Thread Ruben Fiszel (JIRA)
Ruben Fiszel created SPARK-24345:


 Summary: Improve ParseError stop location when offending symbol is 
a token
 Key: SPARK-24345
 URL: https://issues.apache.org/jira/browse/SPARK-24345
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Ruben Fiszel


In the case where the offending symbol of a syntaxError is a CommonToken, this 
PR increases the accuracy of the start and stop origin by leveraging the start 
and stop index information from CommonToken in the syntax error listener.

 

[Github PR](https://github.com/apache/spark/pull/21334)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24273) Failure while using .checkpoint method

2018-05-22 Thread Jami Malikzade (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483908#comment-16483908
 ] 

Jami Malikzade commented on SPARK-24273:


[~kiszk]

I went deeper and found more:

This way it works and creates empty rdd, as after filter 0 rows returned

val df = 
spark.read.option("header","true").option("sep",",").schema(testschema).csv("s3a://phub-1526909295-81/salary.csv").filter('salary
 > 300).withColumn("month", when('name === "Smith", "6").otherwise("3"))
df.checkpoint()

df.show()

 

Thiw way it fails on df.show() and non-empty file is created(though after 
filter 0 rows returned)

val df = 
spark.read.option("header","true").option("sep",",").schema(testschema).csv("s3a://phub-1526909295-81/salary.csv").filter('salary
 > 300).withColumn("month", when('name === "Smith", 
"6").otherwise("3")).checkpoint()

df.show()

> Failure while using .checkpoint method
> --
>
> Key: SPARK-24273
> URL: https://issues.apache.org/jira/browse/SPARK-24273
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.0
>Reporter: Jami Malikzade
>Priority: Major
>
> We are getting following error:
> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 416, AWS 
> Service: Amazon S3, AWS Request ID: 
> tx14126-005ae9bfd9-9ed9ac2-default, AWS Error Code: 
> InvalidRange, AWS Error Message: null, S3 Extended Request ID: 
> 9ed9ac2-default-default"
> when we use checkpoint method as below.
> val streamBucketDF = streamPacketDeltaDF
>  .filter('timeDelta > maxGap && 'timeDelta <= 3)
>  .withColumn("bucket", when('timeDelta <= mediumGap, "medium")
>  .otherwise("large")
>  )
>  .checkpoint()
> Do you have idea how to prevent invalid range in header to be sent, or how it 
> can be workarounded or fixed?
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24346) Executors are unable to fetch remote cache blocks

2018-05-22 Thread Truong Duc Kien (JIRA)
Truong Duc Kien created SPARK-24346:
---

 Summary: Executors are unable to fetch remote cache blocks
 Key: SPARK-24346
 URL: https://issues.apache.org/jira/browse/SPARK-24346
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 2.3.0
 Environment: OS: Centos 7.3
Cluster: Hortonwork HDP 2.6.5 with Spark 2.3.0
Reporter: Truong Duc Kien


After we upgrade from Spark 2.2.1 to Spark 2.3.0, our Spark jobs took a massive 
performance hit because executors become unable to fetch remote cache block 
from each others. The scenario is:

1. An executor creates a connection and sends a ChunkFetchRequest message to 
another executor. 
2. This request arrives at the target executor, which sends back a 
ChunkFetchSuccess response
3. The ChunkFetchSuccess msg never arrives.
4. The connection between these two executors is killed by the originating 
executor after 120s of idleness. At the same time, the other executor report 
that it failed to send the ChunkFetchSuccess because the pipe is closed.

This process repeats itself 3 times, delaying our jobs by 6 minutes, then the 
originating executor decides to stop fetching and calculates the block by 
itself and the job can continue.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24348) scala.MatchError in the "element_at" expression

2018-05-22 Thread Alex Wajda (JIRA)
Alex Wajda created SPARK-24348:
--

 Summary: scala.MatchError in the "element_at" expression
 Key: SPARK-24348
 URL: https://issues.apache.org/jira/browse/SPARK-24348
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Alex Wajda
 Fix For: 2.4.0


When calling {{element_at}} with a wrong first operand type a 
{{scala.MatchError}} is thrown instead of {{AnalysisException}}

*Example:*
{code:sql}
select element_at('foo', 1)
{code}

results in:
{noformat}
scala.MatchError: StringType (of class org.apache.spark.sql.types.StringType$)
at 
org.apache.spark.sql.catalyst.expressions.ElementAt.inputTypes(collectionOperations.scala:1469)
at 
org.apache.spark.sql.catalyst.expressions.ExpectsInputTypes$class.checkInputDataTypes(ExpectsInputTypes.scala:44)
at 
org.apache.spark.sql.catalyst.expressions.ElementAt.checkInputDataTypes(collectionOperations.scala:1478)
at 
org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:168)
at 
org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:168)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:256)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:252)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24348) scala.MatchError in the "element_at" expression

2018-05-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484046#comment-16484046
 ] 

Apache Spark commented on SPARK-24348:
--

User 'wajda' has created a pull request for this issue:
https://github.com/apache/spark/pull/21395

> scala.MatchError in the "element_at" expression
> ---
>
> Key: SPARK-24348
> URL: https://issues.apache.org/jira/browse/SPARK-24348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Alex Wajda
>Priority: Major
> Fix For: 2.4.0
>
>
> When calling {{element_at}} with a wrong first operand type a 
> {{scala.MatchError}} is thrown instead of {{AnalysisException}}
> *Example:*
> {code:sql}
> select element_at('foo', 1)
> {code}
> results in:
> {noformat}
> scala.MatchError: StringType (of class org.apache.spark.sql.types.StringType$)
>   at 
> org.apache.spark.sql.catalyst.expressions.ElementAt.inputTypes(collectionOperations.scala:1469)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpectsInputTypes$class.checkInputDataTypes(ExpectsInputTypes.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.ElementAt.checkInputDataTypes(collectionOperations.scala:1478)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:256)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:252)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24348) scala.MatchError in the "element_at" expression

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24348:


Assignee: (was: Apache Spark)

> scala.MatchError in the "element_at" expression
> ---
>
> Key: SPARK-24348
> URL: https://issues.apache.org/jira/browse/SPARK-24348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Alex Wajda
>Priority: Major
> Fix For: 2.4.0
>
>
> When calling {{element_at}} with a wrong first operand type a 
> {{scala.MatchError}} is thrown instead of {{AnalysisException}}
> *Example:*
> {code:sql}
> select element_at('foo', 1)
> {code}
> results in:
> {noformat}
> scala.MatchError: StringType (of class org.apache.spark.sql.types.StringType$)
>   at 
> org.apache.spark.sql.catalyst.expressions.ElementAt.inputTypes(collectionOperations.scala:1469)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpectsInputTypes$class.checkInputDataTypes(ExpectsInputTypes.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.ElementAt.checkInputDataTypes(collectionOperations.scala:1478)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:256)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:252)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24269) Infer nullability rather than declaring all columns as nullable

2018-05-22 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484071#comment-16484071
 ] 

Simeon Simeonov commented on SPARK-24269:
-

There are many reasons why correct nullability inference is important for any 
data source, not just CSV & JSON. 
 # It can be used to verify the foundation of data contracts, especially in 
data exchange with third parties via something as simple as schema (StructType) 
equality. The common practice is to persist a JSON representation of the 
expected schema.
 # It can substantially improve performance and reduce memory use when dealing 
with Dataset[A <: Product] by using B <: AnyVal directly in case classes as 
opposed to via Option[B].
 # It can simplify the use of code-generation tools.

As an example of (2), consider the following:
{code:java}
import org.apache.spark.util.SizeEstimator
import scala.util.Random.nextInt

case class WithNulls(a: Option[Int], b: Option[Int])
case class WithoutNulls(a: Int, b: Int)

val sizeWith = SizeEstimator.estimate(WithNulls(Some(nextInt), Some(nextInt)))
// 88

val sizeWithout = SizeEstimator.estimate(WithoutNulls(nextInt, nextInt))
// 24

val percentMemoryReduction = 100.0 * (sizeWith - sizeWithout) / sizeWith
// 72.7{code}
I would argue that 70+% savings in memory use are a pretty big deal. The 
savings can be even bigger in the cases of many columns with small primitive 
types (Byte, Short, ...).

As an example of (3), consider tools that code-generate case classes from 
schema. We use tools like that at Swoop for efficient & performant 
transformations that cannot easily happen via the provided operations that work 
on internal rows. Without proper nullability inference, manual configuration 
has to be provided to these tools. We do this routinely, even for ad hoc data 
transformations in notebooks.

[~Teng Peng] I agree that this behavior should not be the default given Spark's 
current behavior. It should be activated via an option.

> Infer nullability rather than declaring all columns as nullable
> ---
>
> Key: SPARK-24269
> URL: https://issues.apache.org/jira/browse/SPARK-24269
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, CSV and JSON datasource set the *nullable* flag to true 
> independently from data itself during schema inferring.
> JSON: 
> https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala#L126
> CSV: 
> https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L51
> For example, source dataset has schema:
> {code}
> root
>  |-- item_id: integer (nullable = false)
>  |-- country: string (nullable = false)
>  |-- state: string (nullable = false)
> {code}
> If we save it and read again the schema of the inferred dataset is
> {code}
> root
>  |-- item_id: integer (nullable = true)
>  |-- country: string (nullable = true)
>  |-- state: string (nullable = true)
> {code}
> The ticket aims to set the nullable flag more precisely during schema 
> inferring based on read data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2018-05-22 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484069#comment-16484069
 ] 

Imran Rashid commented on SPARK-6235:
-

Would be nice to find a better home for this, but for now I wanted to share the 
test code I'm running to see if there are cases I'm missing:

https://github.com/squito/spark_2gb_test/blob/master/src/main/scala/com/cloudera/sparktest/LargeBlocks.scala

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24341) Codegen compile error from predicate subquery

2018-05-22 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484225#comment-16484225
 ] 

Xiao Li commented on SPARK-24341:
-

cc [~dkbiswal] Could you take a look at this too?

> Codegen compile error from predicate subquery
> -
>
> Key: SPARK-24341
> URL: https://issues.apache.org/jira/browse/SPARK-24341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Juliusz Sompolski
>Priority: Minor
>
> Ran on master:
> {code}
> drop table if exists juleka;
> drop table if exists julekb;
> create table juleka (a integer, b integer);
> create table julekb (na integer, nb integer);
> insert into juleka values (1,1);
> insert into julekb values (1,1);
> select * from juleka where (a, b) not in (select (na, nb) from julekb);
> {code}
> Results in:
> {code}
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 27, Column 29: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 27, Column 29: Cannot compare types "int" and 
> "org.apache.spark.sql.catalyst.InternalRow"
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
>   at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>   at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
>   at 
> com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344)
>   at 
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2316)
>   at 
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:3932)
>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1415)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:92)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:46)
>   at 
> org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:380)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:99)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:97)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>   at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>   at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:202)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49)
>   at 
> org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126)
>   at 
> org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:111)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  

[jira] [Comment Edited] (SPARK-13638) Support for saving with a quote mode

2018-05-22 Thread Umesh K (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484255#comment-16484255
 ] 

Umesh K edited comment on SPARK-13638 at 5/22/18 4:36 PM:
--

[~rxin] Just want to confirm are we going to have quoteMode in future or we 
always have to use quoteAll?


was (Author: kachau):
Just want to confirm are we ever going to have quoteMode or we always have to 
use quoteAll?

> Support for saving with a quote mode
> 
>
> Key: SPARK-13638
> URL: https://issues.apache.org/jira/browse/SPARK-13638
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Jurriaan Pruis
>Priority: Minor
> Fix For: 2.0.0
>
>
> https://github.com/databricks/spark-csv/pull/254
> tobithiel reported this.
> {quote}
> I'm dealing with some messy csv files and being able to just quote all fields 
> is very useful,
> so that other applications don't misunderstand the file because of some 
> sketchy characters
> {quote}
> When writing there are several quote modes in apache commons csv. (See 
> https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html)
> This might have to be supported.
> However, it looks univocity parser used for writing (it looks currently only 
> this library is supported) does not support this quote mode. I think we can 
> drop this backwards compatibility if we are not going to add apache commons 
> csv.
> This is a reminder that it might break backwards compatibility for the 
> options, {{quoteMode}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24351) offsetLog/commitLog purge thresholdBatchId should be computed with current committed epoch but not currentBatchId in CP mode

2018-05-22 Thread huangtengfei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangtengfei updated SPARK-24351:
-
Description: 
In structured streaming, there is a conf spark.sql.streaming.minBatchesToRetain 
which is set to specify 'The minimum number of batches that must be retained 
and made recoverable' as described in 
[SQLConf|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L802].
 In continuous processing, the metadata purge is triggered when an epoch is 
committed in 
[ContinuousExecution|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala#L306].
 
 Since currentBatchId increases independently in cp mode, the current committed 
epoch may be far behind currentBatchId if some task hangs for some time. It is 
not safe to discard the metadata with thresholdBatchId computed based on 
currentBatchId because we may clean all the metadata in the checkpoint 
directory.

  was:
In structured streaming, there is a conf spark.sql.streaming.minBatchesToRetain 
which is set to specify 'The minimum number of batches that must be retained 
and made recoverable' as described in 
[SQLConf](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L802).
 In continuous processing, the metadata purge is triggered when an epoch is 
committed in 
[ContinuousExecution](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala#L306).
 
Since currentBatchId increases independently in cp mode, the current committed 
epoch may be far behind currentBatchId if some task hangs for some time. It is 
not safe to discard the metadata with thresholdBatchId computed based on 
currentBatchId because we may clean all the metadata in the checkpoint 
directory.


> offsetLog/commitLog purge thresholdBatchId should be computed with current 
> committed epoch but not currentBatchId in CP mode
> 
>
> Key: SPARK-24351
> URL: https://issues.apache.org/jira/browse/SPARK-24351
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: huangtengfei
>Priority: Major
>
> In structured streaming, there is a conf 
> spark.sql.streaming.minBatchesToRetain which is set to specify 'The minimum 
> number of batches that must be retained and made recoverable' as described in 
> [SQLConf|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L802].
>  In continuous processing, the metadata purge is triggered when an epoch is 
> committed in 
> [ContinuousExecution|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala#L306].
>  
>  Since currentBatchId increases independently in cp mode, the current 
> committed epoch may be far behind currentBatchId if some task hangs for some 
> time. It is not safe to discard the metadata with thresholdBatchId computed 
> based on currentBatchId because we may clean all the metadata in the 
> checkpoint directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24353) Add support for pod affinity/anti-affinity

2018-05-22 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-24353:
---

 Summary: Add support for pod affinity/anti-affinity
 Key: SPARK-24353
 URL: https://issues.apache.org/jira/browse/SPARK-24353
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Stavros Kontopoulos


Spark on K8s allows to place driver/executor pods on specific k8s nodes, using 
nodeSelector. NodeSelector is a very simple way to constrain pods to nodes with 
particular labels. The 
[affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
 feature, currently in beta, greatly expands the types of constraints you can 
express. Aim here is to bring support to Spark on K8s. Note here that 
nodeSelector will be deprecated in the future.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24257) LongToUnsafeRowMap calculate the new size may be wrong

2018-05-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24257:

Labels: correctness  (was: )

> LongToUnsafeRowMap calculate the new size may be wrong
> --
>
> Key: SPARK-24257
> URL: https://issues.apache.org/jira/browse/SPARK-24257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: dzcxzl
>Priority: Minor
>  Labels: correctness
>
> LongToUnsafeRowMap
> Calculate the new size simply by multiplying by 2
> At this time, the size of the application may not be enough to store data
> Some data is lost and the data read out is dirty



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24257) LongToUnsafeRowMap calculate the new size may be wrong

2018-05-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24257:

Priority: Blocker  (was: Minor)

> LongToUnsafeRowMap calculate the new size may be wrong
> --
>
> Key: SPARK-24257
> URL: https://issues.apache.org/jira/browse/SPARK-24257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: dzcxzl
>Priority: Blocker
>  Labels: correctness
>
> LongToUnsafeRowMap
> Calculate the new size simply by multiplying by 2
> At this time, the size of the application may not be enough to store data
> Some data is lost and the data read out is dirty



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24352) Flaky test: StandaloneDynamicAllocationSuite

2018-05-22 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-24352:
--

 Summary: Flaky test: StandaloneDynamicAllocationSuite
 Key: SPARK-24352
 URL: https://issues.apache.org/jira/browse/SPARK-24352
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Affects Versions: 2.3.0
Reporter: Marcelo Vanzin


>From jenkins:

[https://amplab.cs.berkeley.edu/jenkins/user/vanzin/my-views/view/Spark/job/spark-branch-2.3-test-maven-hadoop-2.6/384/testReport/junit/org.apache.spark.deploy/StandaloneDynamicAllocationSuite/executor_registration_on_a_blacklisted_host_must_fail/]

 
{noformat}
Error Message
There is already an RpcEndpoint called CoarseGrainedScheduler
Stacktrace
  java.lang.IllegalArgumentException: There is already an RpcEndpoint 
called CoarseGrainedScheduler
  at 
org.apache.spark.rpc.netty.Dispatcher.registerRpcEndpoint(Dispatcher.scala:71)
  at 
org.apache.spark.rpc.netty.NettyRpcEnv.setupEndpoint(NettyRpcEnv.scala:130)
  at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.createDriverEndpointRef(CoarseGrainedSchedulerBackend.scala:396)
  at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.start(CoarseGrainedSchedulerBackend.scala:391)
  at 
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.start(StandaloneSchedulerBackend.scala:61)
  at 
org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply$mcV$sp(StandaloneDynamicAllocationSuite.scala:512)
  at 
org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply(StandaloneDynamicAllocationSuite.scala:495)
  at 
org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply(StandaloneDynamicAllocationSuite.scala:495)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
  at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
  at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
{noformat}

This actually looks like a previous test is leaving some stuff running and 
making this one fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24353) Add support for pod affinity/anti-affinity

2018-05-22 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24353:

Description: 
Spark on K8s allows to place driver/executor pods on specific k8s nodes, using 
nodeSelector. NodeSelector is a very simple way to constrain pods to nodes with 
particular labels. The 
[affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
 feature, currently in beta, greatly expands the types of constraints you can 
express. Aim here is to bring support of this feature to Spark on K8s. Note 
that nodeSelector will be deprecated in the future.

 

 

  was:
Spark on K8s allows to place driver/executor pods on specific k8s nodes, using 
nodeSelector. NodeSelector is a very simple way to constrain pods to nodes with 
particular labels. The 
[affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
 feature, currently in beta, greatly expands the types of constraints you can 
express. Aim here is to bring support to Spark on K8s. Note here that 
nodeSelector will be deprecated in the future.

 

 


> Add support for pod affinity/anti-affinity
> --
>
> Key: SPARK-24353
> URL: https://issues.apache.org/jira/browse/SPARK-24353
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Spark on K8s allows to place driver/executor pods on specific k8s nodes, 
> using nodeSelector. NodeSelector is a very simple way to constrain pods to 
> nodes with particular labels. The 
> [affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
>  feature, currently in beta, greatly expands the types of constraints you can 
> express. Aim here is to bring support of this feature to Spark on K8s. Note 
> that nodeSelector will be deprecated in the future.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13638) Support for saving with a quote mode

2018-05-22 Thread Umesh K (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484255#comment-16484255
 ] 

Umesh K commented on SPARK-13638:
-

Just want to confirm are we ever going to have quoteMode or we always have to 
use quoteAll?

> Support for saving with a quote mode
> 
>
> Key: SPARK-13638
> URL: https://issues.apache.org/jira/browse/SPARK-13638
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Jurriaan Pruis
>Priority: Minor
> Fix For: 2.0.0
>
>
> https://github.com/databricks/spark-csv/pull/254
> tobithiel reported this.
> {quote}
> I'm dealing with some messy csv files and being able to just quote all fields 
> is very useful,
> so that other applications don't misunderstand the file because of some 
> sketchy characters
> {quote}
> When writing there are several quote modes in apache commons csv. (See 
> https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html)
> This might have to be supported.
> However, it looks univocity parser used for writing (it looks currently only 
> this library is supported) does not support this quote mode. I think we can 
> drop this backwards compatibility if we are not going to add apache commons 
> csv.
> This is a reminder that it might break backwards compatibility for the 
> options, {{quoteMode}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24257) LongToUnsafeRowMap calculate the new size may be wrong

2018-05-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24257:

Target Version/s: 2.3.1  (was: 2.3.2)

> LongToUnsafeRowMap calculate the new size may be wrong
> --
>
> Key: SPARK-24257
> URL: https://issues.apache.org/jira/browse/SPARK-24257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: dzcxzl
>Priority: Blocker
>  Labels: correctness
>
> LongToUnsafeRowMap
> Calculate the new size simply by multiplying by 2
> At this time, the size of the application may not be enough to store data
> Some data is lost and the data read out is dirty



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24257) LongToUnsafeRowMap calculate the new size may be wrong

2018-05-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24257:

Target Version/s: 2.3.2

> LongToUnsafeRowMap calculate the new size may be wrong
> --
>
> Key: SPARK-24257
> URL: https://issues.apache.org/jira/browse/SPARK-24257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: dzcxzl
>Priority: Blocker
>  Labels: correctness
>
> LongToUnsafeRowMap
> Calculate the new size simply by multiplying by 2
> At this time, the size of the application may not be enough to store data
> Some data is lost and the data read out is dirty



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24339) spark sql can not prune column in transform/map/reduce query

2018-05-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486561#comment-16486561
 ] 

Hyukjin Kwon commented on SPARK-24339:
--

(Don't set the target versions usually reserved for committers and affects 
versions usually set after actually being fixed.

> spark sql can not prune column in transform/map/reduce query
> 
>
> Key: SPARK-24339
> URL: https://issues.apache.org/jira/browse/SPARK-24339
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0, 2.2.1
>Reporter: xdcjie
>Priority: Minor
>  Labels: map, reduce, sql, transform
>
> I was using {{TRANSFORM USING}} with branch-2.1/2.2, and noticed that it will 
> scan all column of data, query like:
> {code:java}
> SELECT TRANSFORM(usid, cch) USING 'python test.py' AS (u1, c1, u2, c2) FROM 
> test_table;{code}
> it's physical plan like:
> {code:java}
> == Physical Plan ==
> ScriptTransformation [usid#17, cch#9], python test.py, [u1#784, c1#785, 
> u2#786, c2#787], 
> HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,
> )),List((field.delim,   
> )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
> +- FileScan parquet 
> [sh#0L,clk#1L,chg#2L,qey#3,ship#4,chgname#5,sid#6,bid#7,dis#8L,cch#9,wch#10,wid#11L,arank#12L,rtag#13,iid#14,uid#15,pid#16,usid#17,wdid#18,bid#19,oqleft#20,oqright#21,poqvalue#22,tm#23,...
>  367 more fields] Batched: false, Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/Downloads/part-r-00093-0ef5d59f-2e08-4085-9b46-458a1652932a.g...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct {code}
> In our scenario, parquet has 400 column, this query will take more time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24358) createDataFrame in Python 3 should be able to infer bytes type as Binary type

2018-05-22 Thread Joel Croteau (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Croteau updated SPARK-24358:
-
 Labels: Python3  (was: )
Description: createDataFrame can infer Python 3's bytearray type as a 
Binary. Since bytes is just the immutable, hashable version of this same 
structure, it makes sense for the same thing to apply there.  (was: 
createDataFrame can infer Python's bytearray type as a Binary. Since bytes is 
just the immutable, hashable version of this same structure, it makes sense for 
the same thing to apply there.)
Summary: createDataFrame in Python 3 should be able to infer bytes type 
as Binary type  (was: createDataFrame in Python should be able to infer bytes 
type as Binary type)

> createDataFrame in Python 3 should be able to infer bytes type as Binary type
> -
>
> Key: SPARK-24358
> URL: https://issues.apache.org/jira/browse/SPARK-24358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Minor
>  Labels: Python3
>
> createDataFrame can infer Python 3's bytearray type as a Binary. Since bytes 
> is just the immutable, hashable version of this same structure, it makes 
> sense for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22366) Support ignoreMissingFiles flag parallel to ignoreCorruptFiles

2018-05-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486717#comment-16486717
 ] 

Hyukjin Kwon commented on SPARK-22366:
--

Oops, sorry. I mistakenly edited the JIRA. I reverted it back.

> Support ignoreMissingFiles flag parallel to ignoreCorruptFiles
> --
>
> Key: SPARK-22366
> URL: https://issues.apache.org/jira/browse/SPARK-22366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Minor
> Fix For: 2.3.0
>
>
> There's an existing flag "spark.sql.files.ignoreCorruptFiles" that will 
> quietly ignore attempted reads from files that have been corrupted, but it 
> still allows the query to fail on missing files. Being able to ignore missing 
> files too is useful in some replication scenarios.
> We should add a "spark.sql.files.ignoreMissingFiles" to fill out the 
> functionality.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22366) Support ignoreMissingFiles flag parallel to ignoreCorruptFiles

2018-05-22 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22366:
-
Description: 
+underlined text+There's an existing flag "spark.sql.files.ignoreCorruptFiles" 
that will quietly ignore attempted reads from files that have been corrupted, 
but it still allows the query to fail on missing files. Being able to ignore 
missing files too is useful in some replication scenarios.

We should add a "spark.sql.files.ignoreMissingFiles" to fill out the 
functionality.

  was:
There's an existing flag "spark.sql.files.ignoreCorruptFiles" that will quietly 
ignore attempted reads from files that have been corrupted, but it still allows 
the query to fail on missing files. Being able to ignore missing files too is 
useful in some replication scenarios.

We should add a "spark.sql.files.ignoreMissingFiles" to fill out the 
functionality.


> Support ignoreMissingFiles flag parallel to ignoreCorruptFiles
> --
>
> Key: SPARK-22366
> URL: https://issues.apache.org/jira/browse/SPARK-22366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Minor
> Fix For: 2.3.0
>
>
> +underlined text+There's an existing flag 
> "spark.sql.files.ignoreCorruptFiles" that will quietly ignore attempted reads 
> from files that have been corrupted, but it still allows the query to fail on 
> missing files. Being able to ignore missing files too is useful in some 
> replication scenarios.
> We should add a "spark.sql.files.ignoreMissingFiles" to fill out the 
> functionality.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22366) Support ignoreMissingFiles flag parallel to ignoreCorruptFiles

2018-05-22 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22366:
-
Description: 
There's an existing flag "spark.sql.files.ignoreCorruptFiles" that will quietly 
ignore attempted reads from files that have been corrupted, but it still allows 
the query to fail on missing files. Being able to ignore missing files too is 
useful in some replication scenarios.

We should add a "spark.sql.files.ignoreMissingFiles" to fill out the 
functionality.

  was:
+underlined text+There's an existing flag "spark.sql.files.ignoreCorruptFiles" 
that will quietly ignore attempted reads from files that have been corrupted, 
but it still allows the query to fail on missing files. Being able to ignore 
missing files too is useful in some replication scenarios.

We should add a "spark.sql.files.ignoreMissingFiles" to fill out the 
functionality.


> Support ignoreMissingFiles flag parallel to ignoreCorruptFiles
> --
>
> Key: SPARK-22366
> URL: https://issues.apache.org/jira/browse/SPARK-22366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Minor
> Fix For: 2.3.0
>
>
> There's an existing flag "spark.sql.files.ignoreCorruptFiles" that will 
> quietly ignore attempted reads from files that have been corrupted, but it 
> still allows the query to fail on missing files. Being able to ignore missing 
> files too is useful in some replication scenarios.
> We should add a "spark.sql.files.ignoreMissingFiles" to fill out the 
> functionality.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24342) Large Task prior scheduling to Reduce overall execution time

2018-05-22 Thread gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gao updated SPARK-24342:

Priority: Major  (was: Minor)

> Large Task prior scheduling to Reduce overall execution time
> 
>
> Key: SPARK-24342
> URL: https://issues.apache.org/jira/browse/SPARK-24342
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.3.0
>Reporter: gao
>Priority: Major
> Attachments: tasktimespan.PNG
>
>
> When performing a set of concurrent tasks, if the relatively large task 
> (long-time task) performs the first small-task execution, the overall 
> execution time 
> can be shortened.
> Therefore, Spark needs to add a new function to perform Large Task of a group 
> of task set prior scheduling and small tasks after scheduling
>    The time span of the task can be identified based on the historical 
> execution time. We can think that the task with a long execution time will 
> longe in 
> future. Record the last task execution time together with the task's key as a 
> log file, and load the log file at the next execution time. use The 
> RangePartitioner and partitioning partitioning methods prioritize large tasks 
> and can achieve concurrent task optimization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24358) createDataFrame in Python should be able to infer bytes type as Binary type

2018-05-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486569#comment-16486569
 ] 

Hyukjin Kwon commented on SPARK-24358:
--

? do you mean bytes in Python 2? that's an alias for str, isn't it?

> createDataFrame in Python should be able to infer bytes type as Binary type
> ---
>
> Key: SPARK-24358
> URL: https://issues.apache.org/jira/browse/SPARK-24358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Minor
>
> createDataFrame can infer Python's bytearray type as a Binary. Since bytes is 
> just the immutable, hashable version of this same structure, it makes sense 
> for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24358) createDataFrame in Python should be able to infer bytes type as Binary type

2018-05-22 Thread Joel Croteau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486581#comment-16486581
 ] 

Joel Croteau commented on SPARK-24358:
--

No, I mean the bytes type in Python 3. This code:
{code:java}
from pyspark.sql import SparkSession, Row

TEST_DATA = [Row(data=b'Test string')]


def init_session():
builder = SparkSession.builder.appName("Test bytes serialization")
return builder.getOrCreate()


def main():
spark = init_session()
frame = spark.createDataFrame(TEST_DATA)
frame.printSchema()
print(frame.collect())


__name__ == '__main__' and main()
{code}
 Fails under Python 3 with this output:
{noformat}
Traceback (most recent call last):
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1068, in _infer_type
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1094, in _infer_schema
TypeError: Can not infer schema for type: 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jcroteau/is/pel_selection/test_row_pair.py", line 18, in 
    __name__ == '__main__' and main()
  File "/home/jcroteau/is/pel_selection/test_row_pair.py", line 13, in main
    frame = spark.createDataFrame(TEST_DATA)
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 689, in createDataFrame
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 410, in _createFromLocal
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 342, in _inferSchemaFromList
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 342, in 
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1096, in _infer_schema
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1096, in 
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1070, in _infer_type
TypeError: not supported type: 
{noformat}
but if I change the data type to bytearray:
{code}
from pyspark.sql import SparkSession, Row

TEST_DATA = [Row(data=bytearray(b'Test string'))]


def init_session():
builder = SparkSession.builder.appName("Use bytearray instead")
return builder.getOrCreate()


def main():
spark = init_session()
frame = spark.createDataFrame(TEST_DATA)
frame.printSchema()
print(frame.collect())


__name__ == '__main__' and main()

{code}
it runs fine:
{noformat}
root
 |-- data: binary (nullable = true)

[Row(data=bytearray(b'Test string'))]
{noformat}
bytes in Python 3 is just an immutable version of bytearry, so it should infer 
the type as binary and serialize it the same way it does with bytearray.

> createDataFrame in Python should be able to infer bytes type as Binary type
> ---
>
> Key: SPARK-24358
> URL: https://issues.apache.org/jira/browse/SPARK-24358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Minor
>
> createDataFrame can infer Python's bytearray type as a Binary. Since bytes is 
> just the immutable, hashable version of this same structure, it makes sense 
> for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24358) createDataFrame in Python should be able to infer bytes type as Binary type

2018-05-22 Thread Joel Croteau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486581#comment-16486581
 ] 

Joel Croteau edited comment on SPARK-24358 at 5/23/18 1:47 AM:
---

No, I mean the bytes type in Python 3. This code:
{code:java}
from pyspark.sql import SparkSession, Row

TEST_DATA = [Row(data=b'Test string')]


def init_session():
builder = SparkSession.builder.appName("Test bytes serialization")
return builder.getOrCreate()


def main():
spark = init_session()
frame = spark.createDataFrame(TEST_DATA)
frame.printSchema()
print(frame.collect())


__name__ == '__main__' and main()
{code}
 Fails under Python 3 with this output:
{noformat}
Traceback (most recent call last):
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1068, in _infer_type
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1094, in _infer_schema
TypeError: Can not infer schema for type: 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jcroteau/is/pel_selection/test_row_pair.py", line 18, in 
    __name__ == '__main__' and main()
  File "/home/jcroteau/is/pel_selection/test_row_pair.py", line 13, in main
    frame = spark.createDataFrame(TEST_DATA)
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 689, in createDataFrame
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 410, in _createFromLocal
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 342, in _inferSchemaFromList
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 342, in 
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1096, in _infer_schema
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1096, in 
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1070, in _infer_type
TypeError: not supported type: 
{noformat}
but if I change the data type to bytearray:
{code:java}
from pyspark.sql import SparkSession, Row

TEST_DATA = [Row(data=bytearray(b'Test string'))]


def init_session():
builder = SparkSession.builder.appName("Use bytearray instead")
return builder.getOrCreate()


def main():
spark = init_session()
frame = spark.createDataFrame(TEST_DATA)
frame.printSchema()
print(frame.collect())


__name__ == '__main__' and main()

{code}
it runs fine:
{noformat}
root
 |-- data: binary (nullable = true)

[Row(data=bytearray(b'Test string'))]
{noformat}
bytes in Python 3 is just an immutable version of bytearray, so it should infer 
the type as binary and serialize it the same way it does with bytearray.


was (Author: tv4fun):
No, I mean the bytes type in Python 3. This code:
{code:java}
from pyspark.sql import SparkSession, Row

TEST_DATA = [Row(data=b'Test string')]


def init_session():
builder = SparkSession.builder.appName("Test bytes serialization")
return builder.getOrCreate()


def main():
spark = init_session()
frame = spark.createDataFrame(TEST_DATA)
frame.printSchema()
print(frame.collect())


__name__ == '__main__' and main()
{code}
 Fails under Python 3 with this output:
{noformat}
Traceback (most recent call last):
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1068, in _infer_type
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1094, in _infer_schema
TypeError: Can not infer schema for type: 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jcroteau/is/pel_selection/test_row_pair.py", line 18, in 
    __name__ == '__main__' and main()
  File "/home/jcroteau/is/pel_selection/test_row_pair.py", line 13, in main
    frame = spark.createDataFrame(TEST_DATA)
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 689, in createDataFrame
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 410, in _createFromLocal
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 342, in _inferSchemaFromList
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 342, in 
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1096, in _infer_schema
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1096, in 
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1070, in _infer_type
TypeError: not supported type: 
{noformat}
but if 

[jira] [Commented] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer

2018-05-22 Thread Misha Dmitriev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486580#comment-16486580
 ] 

Misha Dmitriev commented on SPARK-24356:


I plan to work on this feature.

> Duplicate strings in File.path managed by FileSegmentManagedBuffer
> --
>
> Key: SPARK-24356
> URL: https://issues.apache.org/jira/browse/SPARK-24356
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Priority: Major
>
> I recently analyzed a heap dump of Yarn Node Manager that was suffering from 
> high GC pressure due to high object churn. Analysis was done with the jxray 
> tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a 
> number of well-known memory issues. One problem that it found in this dump is 
> 19.5% of memory wasted due to duplicate strings. Of these duplicates, more 
> than a half come from {{FileInputStream.path}} and {{File.path}}. All the 
> {{FileInputStream}} objects that JXRay shows are garbage - looks like they 
> are used for a very short period and then discarded (I guess there is a 
> separate question of whether that's a good pattern). But {{File}} instances 
> are traceable to 
> {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here 
> is the full reference chain:
>  
> {code:java}
> ↖java.io.File.path
> ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file
> ↖{j.u.ArrayList}
> ↖j.u.ArrayList$Itr.this$0
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance
> {code}
>  
> Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very 
> similar, so I think {{FileInputStream}}s are generated by the 
> {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely 
> come from 
> [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263]
>  
> To avoid duplicate strings in {{File.path}}'s in this case, it is suggested 
> that in the above code we create a File with a complete, normalized pathname, 
> that has been already interned. This will prevent the code inside 
> {{java.io.File}} from modifying this string, and thus it will use the 
> interned copy, and will pass it to FileInputStream. Essentially the current 
> line
> {code:java}
> return new File(new File(localDir, String.format("%02x", subDirId)), 
> filename);{code}
> should be replaced with something like
> {code:java}
> String pathname = localDir + File.separator + String.format(...) + 
> File.separator + filename;
> pathname = fileSystem.normalize(pathname).intern();
> return new File(pathname);{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24357) createDataFrame in Python infers large integers as long type and then fails silently when converting them

2018-05-22 Thread Joel Croteau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486571#comment-16486571
 ] 

Joel Croteau commented on SPARK-24357:
--

Fair enough, here is some code to reproduce it:
{code:python}
from pyspark.sql import SparkSession, Row

TEST_DATA = [Row(data=1 << 65)]


def init_session():
builder = SparkSession.builder.appName("Demonstrate integer overflow")
return builder.getOrCreate()


def main():
spark = init_session()
frame = spark.createDataFrame(TEST_DATA)
frame.printSchema()
print(frame.collect())


__name__ == '__main__' and main()

{code}
This should either infer a type that can hold the 1 << 65 value from TEST_DATA, 
or produce a runtime error about inferring the schema or serializing the data. 
This is the actual output:
{noformat}
root
 |-- data: long (nullable = true)

[Row(data=None)]
{noformat}

> createDataFrame in Python infers large integers as long type and then fails 
> silently when converting them
> -
>
> Key: SPARK-24357
> URL: https://issues.apache.org/jira/browse/SPARK-24357
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Major
>
> When inferring the schema type of an RDD passed to createDataFrame, PySpark 
> SQL will infer any integral type as a LongType, which is a 64-bit integer, 
> without actually checking whether the values will fit into a 64-bit slot. If 
> the values are larger than 64 bits, then when pickled and unpickled in Java, 
> Unpickler will convert them to BigIntegers. When applySchemaToPythonRDD is 
> called, it will ignore the BigInteger type and return Null. This results in 
> any large integers in the resulting DataFrame being silently converted to 
> None. This can create some very surprising and difficult to debug behavior, 
> in particular if you are not aware of this limitation. There should either be 
> a runtime error at some point in this conversion chain, or else _infer_type 
> should infer larger integers as DecimalType with appropriate precision, or as 
> BinaryType. The former would be less convenient, but the latter may be 
> problematic to implement in practice. In any case, we should stop silently 
> converting large integers to None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24358) createDataFrame in Python 3 should be able to infer bytes type as Binary type

2018-05-22 Thread Joel Croteau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486594#comment-16486594
 ] 

Joel Croteau commented on SPARK-24358:
--

Done.

> createDataFrame in Python 3 should be able to infer bytes type as Binary type
> -
>
> Key: SPARK-24358
> URL: https://issues.apache.org/jira/browse/SPARK-24358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Minor
>  Labels: Python3
>
> createDataFrame can infer Python 3's bytearray type as a Binary. Since bytes 
> is just the immutable, hashable version of this same structure, it makes 
> sense for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24361) Polish code block manipulation API

2018-05-22 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-24361:
---

 Summary: Polish code block manipulation API
 Key: SPARK-24361
 URL: https://issues.apache.org/jira/browse/SPARK-24361
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Liang-Chi Hsieh


Current code block manipulation API is immature and hacky. We should have a 
formal API to manipulate code blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24342) Large Task prior scheduling to Reduce overall execution time

2018-05-22 Thread gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gao updated SPARK-24342:

Component/s: (was: Optimizer)
 Spark Core

> Large Task prior scheduling to Reduce overall execution time
> 
>
> Key: SPARK-24342
> URL: https://issues.apache.org/jira/browse/SPARK-24342
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: gao
>Priority: Major
> Attachments: tasktimespan.PNG
>
>
> When performing a set of concurrent tasks, if the relatively large task 
> (long-time task) performs the first small-task execution, the overall 
> execution time 
> can be shortened.
> Therefore, Spark needs to add a new function to perform Large Task of a group 
> of task set prior scheduling and small tasks after scheduling
>    The time span of the task can be identified based on the historical 
> execution time. We can think that the task with a long execution time will 
> longe in 
> future. Record the last task execution time together with the task's key as a 
> log file, and load the log file at the next execution time. use The 
> RangePartitioner and partitioning partitioning methods prioritize large tasks 
> and can achieve concurrent task optimization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-24349) obtainDelegationTokens() exits JVM if Driver use JDBC instead of using metastore

2018-05-22 Thread Lantao Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin closed SPARK-24349.
--

> obtainDelegationTokens() exits JVM if Driver use JDBC instead of using 
> metastore 
> -
>
> Key: SPARK-24349
> URL: https://issues.apache.org/jira/browse/SPARK-24349
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Lantao Jin
>Priority: Major
>
> In [SPARK-23639|https://issues.apache.org/jira/browse/SPARK-23639], use 
> --proxy-user to impersonate will invoke obtainDelegationTokens(), But from 
> that, if current settings is connecting to DB directly via JDBC instead of 
> RPC with metastore, it will failed with
> {code}
> WARN HiveConf: HiveConf of name hive.server2.enable.impersonation does not 
> exist
> Exception in thread "main" java.lang.IllegalArgumentException: requirement 
> failed: Hive metastore uri undefined
> at scala.Predef$.require(Predef.scala:224)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveCredentialProvider.obtainCredentials(HiveCredentialProvider.scala:73)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:56)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:169)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:167)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:167)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 18/05/22 05:24:16 INFO ShutdownHookManager: Shutdown hook called
> 18/05/22 05:24:16 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-b63ad788-1a47-4326-9972-c4fde1dc19c3
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24349) obtainDelegationTokens() exits JVM if Driver use JDBC instead of using metastore

2018-05-22 Thread Lantao Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin resolved SPARK-24349.

Resolution: Not A Problem

delegationTokensRequired has been checked in SparkSQLCLIDriver.scala

> obtainDelegationTokens() exits JVM if Driver use JDBC instead of using 
> metastore 
> -
>
> Key: SPARK-24349
> URL: https://issues.apache.org/jira/browse/SPARK-24349
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Lantao Jin
>Priority: Major
>
> In [SPARK-23639|https://issues.apache.org/jira/browse/SPARK-23639], use 
> --proxy-user to impersonate will invoke obtainDelegationTokens(), But from 
> that, if current settings is connecting to DB directly via JDBC instead of 
> RPC with metastore, it will failed with
> {code}
> WARN HiveConf: HiveConf of name hive.server2.enable.impersonation does not 
> exist
> Exception in thread "main" java.lang.IllegalArgumentException: requirement 
> failed: Hive metastore uri undefined
> at scala.Predef$.require(Predef.scala:224)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveCredentialProvider.obtainCredentials(HiveCredentialProvider.scala:73)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:56)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:169)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:167)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:167)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 18/05/22 05:24:16 INFO ShutdownHookManager: Shutdown hook called
> 18/05/22 05:24:16 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-b63ad788-1a47-4326-9972-c4fde1dc19c3
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22055) Port release scripts

2018-05-22 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486728#comment-16486728
 ] 

Felix Cheung commented on SPARK-22055:
--

interesting - I'd definitely be happy to help.

do you have it scripted to inject the signing key into the docker image?

 

> Port release scripts
> 
>
> Key: SPARK-22055
> URL: https://issues.apache.org/jira/browse/SPARK-22055
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Blocker
>
> The current Jenkins jobs are generated from scripts in a private repo. We 
> should port these to enable changes like SPARK-22054 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24324) UserDefinedFunction mixes column labels

2018-05-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486565#comment-16486565
 ] 

Hyukjin Kwon commented on SPARK-24324:
--

Ah, I meant shorter reproducer should make other guys easier to take a look.

> UserDefinedFunction mixes column labels
> ---
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> Codename: xenial
> {noformat}
>Reporter: Cristian Consonni
>Priority: Major
>
> I am working on Wikipedia page views (see [task T188041 on Wikimedia's 
> Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's 
> say that these are the data:
> {noformat}
>
> {noformat}
> For each combination of (lang, page, day(timestamp)) I need to transform  the 
>  views for each hour:
> {noformat}
> 00:00 -> A
> 01:00 -> B
> ...
> {noformat}
> and concatenate the number of views for that hour.  So, if a page got 5 views 
> at 00:00 and 7 views at 01:00 it would become:
> {noformat}
> A5B7
> {noformat}
>  
> I have written a UDF called {code:python}concat_hours{code}
> However, the function is mixing the columns and I am not sure what is going 
> on. I wrote here a minimal complete example that reproduces the issue on my 
> system (the details of my environment are above).
> {code:python}
> #!/usr/bin/env python3
> # coding: utf-8
> input_data = b"""en Albert_Camus 20071210-00 150
> en Albert_Camus 20071210-01 148
> en Albert_Camus 20071210-02 197
> en Albert_Camus 20071211-20 145
> en Albert_Camus 20071211-21 131
> en Albert_Camus 20071211-22 154
> en Albert_Camus 20071211-230001 142
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-040001 1
> en Albert_Caquot 20071211-06 1
> en Albert_Caquot 20071211-08 1
> en Albert_Caquot 20071211-15 3
> en Albert_Caquot 20071211-21 1"""
> import tempfile
> fp = tempfile.NamedTemporaryFile()
> fp.write(input_data)
> fp.seek(0)
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql.types import StructType, StructField
> from pyspark.sql.types import StringType, IntegerType, TimestampType
> from pyspark.sql import functions
> sc = pyspark.SparkContext(appName="udf_example")
> sqlctx = pyspark.SQLContext(sc)
> schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("timestamp", TimestampType(), False),
>  StructField("views", IntegerType(), False)])
> df = sqlctx.read.csv(fp.name,
>  header=False,
>  schema=schema,
>  timestampFormat="MMdd-HHmmss",
>  sep=' ')
> df.count()
> df.dtypes
> df.show()
> new_schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("day", StringType(), False),
>  StructField("enc", StringType(), False)])
> from 

  1   2   >