date:20180131

[jira] [Updated] (SPARK-23295) Exclude Waring message when generating versions in make-distribution.sh

2018-01-31 Thread Kent Yao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-23295:
-
Description: 
When we specified a wrong profile to make a spark distribution, such as 
-Phadoop1000, we will get an odd package named like :
spark-[WARNING] The requested profile "hadoop1000" could not be activated 
because it does not exist.-bin-hadoop-2.7.tgz 
which actually should be `"spark-$VERSION-bin-$NAME.tgz"`

  was:When we specified a wrong profile to make a spark distribution, such as 
`-Phadoop1000`, we will get an odd package named like `spark-[WARNING] The 
requested profile "hadoop1000" could not be activated because it does not 
exist.-bin-hadoop-2.7.tgz`, which actually should be 
`"spark-$VERSION-bin-$NAME.tgz"`


> Exclude Waring message when generating versions  in make-distribution.sh
> 
>
> Key: SPARK-23295
> URL: https://issues.apache.org/jira/browse/SPARK-23295
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Kent Yao
>Priority: Major
>
> When we specified a wrong profile to make a spark distribution, such as 
> -Phadoop1000, we will get an odd package named like :
> spark-[WARNING] The requested profile "hadoop1000" could not be activated 
> because it does not exist.-bin-hadoop-2.7.tgz 
> which actually should be `"spark-$VERSION-bin-$NAME.tgz"`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23295) Exclude Waring message when generating versions in make-distribution.sh

2018-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23295:


Assignee: Apache Spark

> Exclude Waring message when generating versions  in make-distribution.sh
> 
>
> Key: SPARK-23295
> URL: https://issues.apache.org/jira/browse/SPARK-23295
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> When we specified a wrong profile to make a spark distribution, such as 
> `-Phadoop1000`, we will get an odd package named like `spark-[WARNING] The 
> requested profile "hadoop1000" could not be activated because it does not 
> exist.-bin-hadoop-2.7.tgz`, which actually should be 
> `"spark-$VERSION-bin-$NAME.tgz"`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23295) Exclude Waring message when generating versions in make-distribution.sh

2018-01-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348146#comment-16348146
 ] 

Apache Spark commented on SPARK-23295:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/20469

> Exclude Waring message when generating versions  in make-distribution.sh
> 
>
> Key: SPARK-23295
> URL: https://issues.apache.org/jira/browse/SPARK-23295
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Kent Yao
>Priority: Major
>
> When we specified a wrong profile to make a spark distribution, such as 
> `-Phadoop1000`, we will get an odd package named like `spark-[WARNING] The 
> requested profile "hadoop1000" could not be activated because it does not 
> exist.-bin-hadoop-2.7.tgz`, which actually should be 
> `"spark-$VERSION-bin-$NAME.tgz"`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23295) Exclude Waring message when generating versions in make-distribution.sh

2018-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23295:


Assignee: (was: Apache Spark)

> Exclude Waring message when generating versions  in make-distribution.sh
> 
>
> Key: SPARK-23295
> URL: https://issues.apache.org/jira/browse/SPARK-23295
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Kent Yao
>Priority: Major
>
> When we specified a wrong profile to make a spark distribution, such as 
> `-Phadoop1000`, we will get an odd package named like `spark-[WARNING] The 
> requested profile "hadoop1000" could not be activated because it does not 
> exist.-bin-hadoop-2.7.tgz`, which actually should be 
> `"spark-$VERSION-bin-$NAME.tgz"`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23284) Document several get API of ColumnVector's behavior when accessing null slot

2018-01-31 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348145#comment-16348145
 ] 

Wenchen Fan commented on SPARK-23284:
-

Since this defines the "return null" behavior for `ColumnVector` 
implementations, it's good to get this in before 2.3 release. But technically 
Spark won't call `ColumnVector.getXXX` if that slot is null, so it's OK to 
leave it to 2.4. Thus I'm not going to mark this as a blocker. cc [~sameerag]

> Document several get API of ColumnVector's behavior when accessing null slot
> 
>
> Key: SPARK-23284
> URL: https://issues.apache.org/jira/browse/SPARK-23284
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> We should clearly document the behavior of some ColumnVector get APIs such as 
> getBinary, getStruct, getArray, etc., when accessing a null slot. Those APIs 
> should return null if the slot is null.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23295) Exclude Waring message when generating versions in make-distribution.sh

2018-01-31 Thread Kent Yao (JIRA)

Kent Yao created SPARK-23295:


 Summary: Exclude Waring message when generating versions  in 
make-distribution.sh
 Key: SPARK-23295
 URL: https://issues.apache.org/jira/browse/SPARK-23295
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.2.1
Reporter: Kent Yao


When we specified a wrong profile to make a spark distribution, such as 
`-Phadoop1000`, we will get an odd package named like `spark-[WARNING] The 
requested profile "hadoop1000" could not be activated because it does not 
exist.-bin-hadoop-2.7.tgz`, which actually should be 
`"spark-$VERSION-bin-$NAME.tgz"`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23284) Document several get API of ColumnVector's behavior when accessing null slot

2018-01-31 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-23284:

Target Version/s: 2.3.0

> Document several get API of ColumnVector's behavior when accessing null slot
> 
>
> Key: SPARK-23284
> URL: https://issues.apache.org/jira/browse/SPARK-23284
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> We should clearly document the behavior of some ColumnVector get APIs such as 
> getBinary, getStruct, getArray, etc., when accessing a null slot. Those APIs 
> should return null if the slot is null.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23284) Document several get API of ColumnVector's behavior when accessing null slot

2018-01-31 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-23284:

Affects Version/s: (was: 2.4.0)
   2.3.0

> Document several get API of ColumnVector's behavior when accessing null slot
> 
>
> Key: SPARK-23284
> URL: https://issues.apache.org/jira/browse/SPARK-23284
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> We should clearly document the behavior of some ColumnVector get APIs such as 
> getBinary, getStruct, getArray, etc., when accessing a null slot. Those APIs 
> should return null if the slot is null.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23202) Add new API in DataSourceWriter: onDataWriterCommit

2018-01-31 Thread Gengliang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-23202:
---
Summary: Add new API in DataSourceWriter: onDataWriterCommit  (was: Break 
down DataSourceV2Writer.commit into two phase)

> Add new API in DataSourceWriter: onDataWriterCommit
> ---
>
> Key: SPARK-23202
> URL: https://issues.apache.org/jira/browse/SPARK-23202
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a 
> writing job with a list of commit messages.
> It makes sense in some scenarios, e.g. MicroBatchExecution.
> However, on receiving commit message, driver can start processing 
> messages(e.g. persist messages into files) before all the messages are 
> collected.
> The proposal is to Break down DataSourceV2Writer.commit into two phase:
>  # add(WriterCommitMessage message): Handles a commit message produced by 
> \{@link DataWriter#commit()}.
>  # commit():  Commits the writing job.
> This should make the API more flexible, and more reasonable for implementing 
> some datasources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23202) Add new API in DataSourceWriter: onDataWriterCommit

2018-01-31 Thread Gengliang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-23202:
---
Description: 
The current DataSourceWriter API makes it hard to implement 
{{onTaskCommit(taskCommit: TaskCommitMessage)}} in {{FileCommitProtocol}}.
 In general, on receiving commit message, driver can start processing 
messages(e.g. persist messages into files) before all the messages are 
collected.

The proposal to add a new API:
 {{add(WriterCommitMessage message)}}: Handles a commit message on receiving 
from a successful data writer.

This should make the whole API of DataSourceWriter compatible with 
{{FileCommitProtocol}}, and more flexible.

There was another radical attempt in 
[#20386|https://github.com/apache/spark/pull/20386]. Creating a new API as 
[#20454|https://github.com/apache/spark/pull/20454] is more reasonable.

  was:
The current DataSourceWriter API makes it hard to implement 
{{onTaskCommit(taskCommit: TaskCommitMessage)}} in {{FileCommitProtocol}}.
In general, on receiving commit message, driver can start processing 
messages(e.g. persist messages into files) before all the messages are 
collected.

The proposal to add a new API:
{{add(WriterCommitMessage message)}}: Handles a commit message on receiving 
from a successful data writer.

This should make the whole API of DataSourceWriter compatible with 
{{FileCommitProtocol}}, and more flexible.

There was another radical attempt in 
[#20386|https://github.com/apache/spark/pull/20386]. Creating a new API is more 
reasonable.


> Add new API in DataSourceWriter: onDataWriterCommit
> ---
>
> Key: SPARK-23202
> URL: https://issues.apache.org/jira/browse/SPARK-23202
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Gengliang Wang
>Priority: Major
>
> The current DataSourceWriter API makes it hard to implement 
> {{onTaskCommit(taskCommit: TaskCommitMessage)}} in {{FileCommitProtocol}}.
>  In general, on receiving commit message, driver can start processing 
> messages(e.g. persist messages into files) before all the messages are 
> collected.
> The proposal to add a new API:
>  {{add(WriterCommitMessage message)}}: Handles a commit message on receiving 
> from a successful data writer.
> This should make the whole API of DataSourceWriter compatible with 
> {{FileCommitProtocol}}, and more flexible.
> There was another radical attempt in 
> [#20386|https://github.com/apache/spark/pull/20386]. Creating a new API as 
> [#20454|https://github.com/apache/spark/pull/20454] is more reasonable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23202) Add new API in DataSourceWriter: onDataWriterCommit

2018-01-31 Thread Gengliang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-23202:
---
Description: 
The current DataSourceWriter API makes it hard to implement 
{{onTaskCommit(taskCommit: TaskCommitMessage)}} in {{FileCommitProtocol}}.
In general, on receiving commit message, driver can start processing 
messages(e.g. persist messages into files) before all the messages are 
collected.

The proposal to add a new API:
{{add(WriterCommitMessage message)}}: Handles a commit message on receiving 
from a successful data writer.

This should make the whole API of DataSourceWriter compatible with 
{{FileCommitProtocol}}, and more flexible.

There was another radical attempt in 
[#20386|https://github.com/apache/spark/pull/20386]. Creating a new API is more 
reasonable.

  was:
Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a 

writing job with a list of commit messages.

It makes sense in some scenarios, e.g. MicroBatchExecution.

However, on receiving commit message, driver can start processing messages(e.g. 
persist messages into files) before all the messages are collected.

The proposal is to Break down DataSourceV2Writer.commit into two phase:
 # add(WriterCommitMessage message): Handles a commit message produced by 
\{@link DataWriter#commit()}.
 # commit():  Commits the writing job.

This should make the API more flexible, and more reasonable for implementing 
some datasources.


> Add new API in DataSourceWriter: onDataWriterCommit
> ---
>
> Key: SPARK-23202
> URL: https://issues.apache.org/jira/browse/SPARK-23202
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Gengliang Wang
>Priority: Major
>
> The current DataSourceWriter API makes it hard to implement 
> {{onTaskCommit(taskCommit: TaskCommitMessage)}} in {{FileCommitProtocol}}.
> In general, on receiving commit message, driver can start processing 
> messages(e.g. persist messages into files) before all the messages are 
> collected.
> The proposal to add a new API:
> {{add(WriterCommitMessage message)}}: Handles a commit message on receiving 
> from a successful data writer.
> This should make the whole API of DataSourceWriter compatible with 
> {{FileCommitProtocol}}, and more flexible.
> There was another radical attempt in 
> [#20386|https://github.com/apache/spark/pull/20386]. Creating a new API is 
> more reasonable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23202) Break down DataSourceV2Writer.commit into two phase

2018-01-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23202:

Priority: Major  (was: Blocker)

> Break down DataSourceV2Writer.commit into two phase
> ---
>
> Key: SPARK-23202
> URL: https://issues.apache.org/jira/browse/SPARK-23202
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a 
> writing job with a list of commit messages.
> It makes sense in some scenarios, e.g. MicroBatchExecution.
> However, on receiving commit message, driver can start processing 
> messages(e.g. persist messages into files) before all the messages are 
> collected.
> The proposal is to Break down DataSourceV2Writer.commit into two phase:
>  # add(WriterCommitMessage message): Handles a commit message produced by 
> \{@link DataWriter#commit()}.
>  # commit():  Commits the writing job.
> This should make the API more flexible, and more reasonable for implementing 
> some datasources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23202) Break down DataSourceV2Writer.commit into two phase

2018-01-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23202:

Target Version/s: 2.4.0  (was: 2.3.0)

> Break down DataSourceV2Writer.commit into two phase
> ---
>
> Key: SPARK-23202
> URL: https://issues.apache.org/jira/browse/SPARK-23202
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a 
> writing job with a list of commit messages.
> It makes sense in some scenarios, e.g. MicroBatchExecution.
> However, on receiving commit message, driver can start processing 
> messages(e.g. persist messages into files) before all the messages are 
> collected.
> The proposal is to Break down DataSourceV2Writer.commit into two phase:
>  # add(WriterCommitMessage message): Handles a commit message produced by 
> \{@link DataWriter#commit()}.
>  # commit():  Commits the writing job.
> This should make the API more flexible, and more reasonable for implementing 
> some datasources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23280) add map type support to ColumnVector

2018-01-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348109#comment-16348109
 ] 

Apache Spark commented on SPARK-23280:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20468

> add map type support to ColumnVector
> 
>
> Key: SPARK-23280
> URL: https://issues.apache.org/jira/browse/SPARK-23280
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10884) Support prediction on single instance for regression and classification related models

2018-01-31 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348108#comment-16348108
 ] 

Weichen Xu commented on SPARK-10884:


[~mingma] I hope so but now the code freeze and is QAing. Sorry for that !

> Support prediction on single instance for regression and classification 
> related models
> --
>
> Key: SPARK-10884
> URL: https://issues.apache.org/jira/browse/SPARK-10884
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Major
>  Labels: 2.2.0
>
> Support prediction on single instance for regression and classification 
> related models (i.e., PredictionModel, ClassificationModel and their sub 
> classes). 
> Add corresponding test cases.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10884) Support prediction on single instance for regression and classification related models

2018-01-31 Thread Ming Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348094#comment-16348094
 ] 

Ming Ma commented on SPARK-10884:
-

Cool. Could we get it in for 2.3? This will bring Spark one step closer 
providing real-time prediction and getting it to 2.3 will make it available to 
more applications sooner. Also the patch looks pretty straightforward and the 
risk seems pretty low.

> Support prediction on single instance for regression and classification 
> related models
> --
>
> Key: SPARK-10884
> URL: https://issues.apache.org/jira/browse/SPARK-10884
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Major
>  Labels: 2.2.0
>
> Support prediction on single instance for regression and classification 
> related models (i.e., PredictionModel, ClassificationModel and their sub 
> classes). 
> Add corresponding test cases.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23294) Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated

2018-01-31 Thread Ravinder Matte (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravinder Matte updated SPARK-23294:
---
Labels: ConsoleSink RateSource backpressure maxRate  (was: ConsoleSink 
RateSource)

> Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated
> ---
>
> Key: SPARK-23294
> URL: https://issues.apache.org/jira/browse/SPARK-23294
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1
>Reporter: Ravinder Matte
>Assignee: Takeshi Yamamuro
>Priority: Minor
>  Labels: ConsoleSink, RateSource, backpressure, maxRate
>
> Using following configs while building SparkSession
> ("spark.streaming.backpressure.enabled", "true")
> ("spark.streaming.receiver.maxRate", "100")
> ("spark.streaming.backpressure.initialRate", "100")
>  
> Source: Rate Source with following options.
> rowsPerSecond=10
> Sink: Console Sink. 
>  
> I am expecting the process rate to limit at 100 Rows per Second, but maxRate 
> is ignored and streaming job is processing at 10 rate.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23294) Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated

2018-01-31 Thread Ravinder Matte (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravinder Matte updated SPARK-23294:
---
Description: 
Using following configs while building SparkSession

("spark.streaming.backpressure.enabled", "true")
("spark.streaming.receiver.maxRate", "100")
("spark.streaming.backpressure.initialRate", "100")

 

Source: Rate Source with following options.

rowsPerSecond=10

Sink: Console Sink. 

 

I am expecting the process rate to limit at 100 Rows per Second, but maxRate is 
ignored and streaming job is processing at 10 rate.

 

  was:
I am calling spark-submit passing maxRate, I have a single kinesis receiver, 
and batches of 1s

spark-submit  --conf spark.streaming.receiver.maxRate=10 

however a single batch can greatly exceed the stablished maxRate. i.e: Im 
getting 300 records.

it looks like Kinesis is completely ignoring the 
spark.streaming.receiver.maxRate configuration.

If you look inside KinesisReceiver.onStart, you see:

val kinesisClientLibConfiguration =
  new KinesisClientLibConfiguration(checkpointAppName, streamName, 
awsCredProvider, workerId)
  .withKinesisEndpoint(endpointUrl)
  .withInitialPositionInStream(initialPositionInStream)
  .withTaskBackoffTimeMillis(500)
  .withRegionName(regionName)

This constructor ends up calling another constructor which has a lot of default 
values for the configuration. One of those values is DEFAULT_MAX_RECORDS which 
is constantly set to 10,000 records.


> Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated
> ---
>
> Key: SPARK-23294
> URL: https://issues.apache.org/jira/browse/SPARK-23294
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1
>Reporter: Ravinder Matte
>Assignee: Takeshi Yamamuro
>Priority: Minor
>  Labels: ConsoleSink, RateSource
>
> Using following configs while building SparkSession
> ("spark.streaming.backpressure.enabled", "true")
> ("spark.streaming.receiver.maxRate", "100")
> ("spark.streaming.backpressure.initialRate", "100")
>  
> Source: Rate Source with following options.
> rowsPerSecond=10
> Sink: Console Sink. 
>  
> I am expecting the process rate to limit at 100 Rows per Second, but maxRate 
> is ignored and streaming job is processing at 10 rate.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23200) Reset configuration when restarting from checkpoints

2018-01-31 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-23200:

Fix Version/s: (was: 2.4.0)

> Reset configuration when restarting from checkpoints
> 
>
> Key: SPARK-23200
> URL: https://issues.apache.org/jira/browse/SPARK-23200
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> Streaming workloads and restarting from checkpoints may need additional 
> changes, i.e. resetting properties -  see 
> https://github.com/apache-spark-on-k8s/spark/pull/516



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23294) Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated

2018-01-31 Thread Ravinder Matte (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravinder Matte updated SPARK-23294:
---
Labels: ConsoleSink RateSource  (was: kinesis)

> Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated
> ---
>
> Key: SPARK-23294
> URL: https://issues.apache.org/jira/browse/SPARK-23294
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1
>Reporter: Ravinder Matte
>Assignee: Takeshi Yamamuro
>Priority: Minor
>  Labels: ConsoleSink, RateSource
>
> I am calling spark-submit passing maxRate, I have a single kinesis receiver, 
> and batches of 1s
> spark-submit  --conf spark.streaming.receiver.maxRate=10 
> however a single batch can greatly exceed the stablished maxRate. i.e: Im 
> getting 300 records.
> it looks like Kinesis is completely ignoring the 
> spark.streaming.receiver.maxRate configuration.
> If you look inside KinesisReceiver.onStart, you see:
> val kinesisClientLibConfiguration =
>   new KinesisClientLibConfiguration(checkpointAppName, streamName, 
> awsCredProvider, workerId)
>   .withKinesisEndpoint(endpointUrl)
>   .withInitialPositionInStream(initialPositionInStream)
>   .withTaskBackoffTimeMillis(500)
>   .withRegionName(regionName)
> This constructor ends up calling another constructor which has a lot of 
> default values for the configuration. One of those values is 
> DEFAULT_MAX_RECORDS which is constantly set to 10,000 records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23294) Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated

2018-01-31 Thread Ravinder Matte (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravinder Matte updated SPARK-23294:
---
Fix Version/s: (was: 2.2.0)

> Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated
> ---
>
> Key: SPARK-23294
> URL: https://issues.apache.org/jira/browse/SPARK-23294
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1
>Reporter: Ravinder Matte
>Assignee: Takeshi Yamamuro
>Priority: Minor
>  Labels: ConsoleSink, RateSource
>
> I am calling spark-submit passing maxRate, I have a single kinesis receiver, 
> and batches of 1s
> spark-submit  --conf spark.streaming.receiver.maxRate=10 
> however a single batch can greatly exceed the stablished maxRate. i.e: Im 
> getting 300 records.
> it looks like Kinesis is completely ignoring the 
> spark.streaming.receiver.maxRate configuration.
> If you look inside KinesisReceiver.onStart, you see:
> val kinesisClientLibConfiguration =
>   new KinesisClientLibConfiguration(checkpointAppName, streamName, 
> awsCredProvider, workerId)
>   .withKinesisEndpoint(endpointUrl)
>   .withInitialPositionInStream(initialPositionInStream)
>   .withTaskBackoffTimeMillis(500)
>   .withRegionName(regionName)
> This constructor ends up calling another constructor which has a lot of 
> default values for the configuration. One of those values is 
> DEFAULT_MAX_RECORDS which is constantly set to 10,000 records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-23200) Reset configuration when restarting from checkpoints

2018-01-31 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reopened SPARK-23200:
-
  Assignee: (was: Santiago Saavedra)

> Reset configuration when restarting from checkpoints
> 
>
> Key: SPARK-23200
> URL: https://issues.apache.org/jira/browse/SPARK-23200
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> Streaming workloads and restarting from checkpoints may need additional 
> changes, i.e. resetting properties -  see 
> https://github.com/apache-spark-on-k8s/spark/pull/516



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23294) Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated

2018-01-31 Thread Ravinder Matte (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravinder Matte updated SPARK-23294:
---
Component/s: (was: DStreams)
 Structured Streaming

> Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated
> ---
>
> Key: SPARK-23294
> URL: https://issues.apache.org/jira/browse/SPARK-23294
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1
>Reporter: Ravinder Matte
>Assignee: Takeshi Yamamuro
>Priority: Minor
>  Labels: ConsoleSink, RateSource
>
> I am calling spark-submit passing maxRate, I have a single kinesis receiver, 
> and batches of 1s
> spark-submit  --conf spark.streaming.receiver.maxRate=10 
> however a single batch can greatly exceed the stablished maxRate. i.e: Im 
> getting 300 records.
> it looks like Kinesis is completely ignoring the 
> spark.streaming.receiver.maxRate configuration.
> If you look inside KinesisReceiver.onStart, you see:
> val kinesisClientLibConfiguration =
>   new KinesisClientLibConfiguration(checkpointAppName, streamName, 
> awsCredProvider, workerId)
>   .withKinesisEndpoint(endpointUrl)
>   .withInitialPositionInStream(initialPositionInStream)
>   .withTaskBackoffTimeMillis(500)
>   .withRegionName(regionName)
> This constructor ends up calling another constructor which has a lot of 
> default values for the configuration. One of those values is 
> DEFAULT_MAX_RECORDS which is constantly set to 10,000 records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23294) Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated

2018-01-31 Thread Ravinder Matte (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravinder Matte updated SPARK-23294:
---
Affects Version/s: (was: 2.0.2)
   2.2.1

> Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated
> ---
>
> Key: SPARK-23294
> URL: https://issues.apache.org/jira/browse/SPARK-23294
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.1
>Reporter: Ravinder Matte
>Assignee: Takeshi Yamamuro
>Priority: Minor
>  Labels: kinesis
> Fix For: 2.2.0
>
>
> I am calling spark-submit passing maxRate, I have a single kinesis receiver, 
> and batches of 1s
> spark-submit  --conf spark.streaming.receiver.maxRate=10 
> however a single batch can greatly exceed the stablished maxRate. i.e: Im 
> getting 300 records.
> it looks like Kinesis is completely ignoring the 
> spark.streaming.receiver.maxRate configuration.
> If you look inside KinesisReceiver.onStart, you see:
> val kinesisClientLibConfiguration =
>   new KinesisClientLibConfiguration(checkpointAppName, streamName, 
> awsCredProvider, workerId)
>   .withKinesisEndpoint(endpointUrl)
>   .withInitialPositionInStream(initialPositionInStream)
>   .withTaskBackoffTimeMillis(500)
>   .withRegionName(regionName)
> This constructor ends up calling another constructor which has a lot of 
> default values for the configuration. One of those values is 
> DEFAULT_MAX_RECORDS which is constantly set to 10,000 records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23294) Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated

2018-01-31 Thread Ravinder Matte (JIRA)

Ravinder Matte created SPARK-23294:
--

 Summary: Spark Streaming + Rate source + Console Sink : Receiver 
MaxRate is violated
 Key: SPARK-23294
 URL: https://issues.apache.org/jira/browse/SPARK-23294
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.0.2
Reporter: Ravinder Matte
Assignee: Takeshi Yamamuro
 Fix For: 2.2.0


I am calling spark-submit passing maxRate, I have a single kinesis receiver, 
and batches of 1s

spark-submit  --conf spark.streaming.receiver.maxRate=10 

however a single batch can greatly exceed the stablished maxRate. i.e: Im 
getting 300 records.

it looks like Kinesis is completely ignoring the 
spark.streaming.receiver.maxRate configuration.

If you look inside KinesisReceiver.onStart, you see:

val kinesisClientLibConfiguration =
  new KinesisClientLibConfiguration(checkpointAppName, streamName, 
awsCredProvider, workerId)
  .withKinesisEndpoint(endpointUrl)
  .withInitialPositionInStream(initialPositionInStream)
  .withTaskBackoffTimeMillis(500)
  .withRegionName(regionName)

This constructor ends up calling another constructor which has a lot of default 
values for the configuration. One of those values is DEFAULT_MAX_RECORDS which 
is constantly set to 10,000 records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-01-31 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23291:
-
Shepherd: Felix Cheung  (was: Hossein Falaki)

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Narendra
>Priority: Major
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22274) User-defined aggregation functions with pandas udf

2018-01-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348033#comment-16348033
 ] 

Apache Spark commented on SPARK-22274:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20467

> User-defined aggregation functions with pandas udf
> --
>
> Key: SPARK-22274
> URL: https://issues.apache.org/jira/browse/SPARK-22274
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> This function doesn't implement partial aggregation and shuffles all data. A 
> uadf that supports partial aggregation is not covered by this Jira.
> Exmaple:
> {code:java}
> @pandas_udf(DoubleType())
> def mean(v)
>   return v.mean()
> df.groupby('id').apply(mean(df.v1), mean(df.v2))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23293) data source v2 self join fails

2018-01-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348020#comment-16348020
 ] 

Apache Spark commented on SPARK-23293:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/20466

> data source v2 self join fails
> --
>
> Key: SPARK-23293
> URL: https://issues.apache.org/jira/browse/SPARK-23293
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23293) data source v2 self join fails

2018-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23293:


Assignee: Wenchen Fan  (was: Apache Spark)

> data source v2 self join fails
> --
>
> Key: SPARK-23293
> URL: https://issues.apache.org/jira/browse/SPARK-23293
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23293) data source v2 self join fails

2018-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23293:


Assignee: Apache Spark  (was: Wenchen Fan)

> data source v2 self join fails
> --
>
> Key: SPARK-23293
> URL: https://issues.apache.org/jira/browse/SPARK-23293
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23293) data source v2 self join fails

2018-01-31 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-23293:
---

 Summary: data source v2 self join fails
 Key: SPARK-23293
 URL: https://issues.apache.org/jira/browse/SPARK-23293
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23188) Make vectorized columar reader batch size configurable

2018-01-31 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23188.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20361
[https://github.com/apache/spark/pull/20361]

> Make vectorized columar reader batch size configurable
> --
>
> Key: SPARK-23188
> URL: https://issues.apache.org/jira/browse/SPARK-23188
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23188) Make vectorized columar reader batch size configurable

2018-01-31 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23188:
---

Assignee: Jiang Xingbo

> Make vectorized columar reader batch size configurable
> --
>
> Key: SPARK-23188
> URL: https://issues.apache.org/jira/browse/SPARK-23188
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23292) python tests related to pandas are skipped

2018-01-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348008#comment-16348008
 ] 

Apache Spark commented on SPARK-23292:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/20465

> python tests related to pandas are skipped
> --
>
> Key: SPARK-23292
> URL: https://issues.apache.org/jira/browse/SPARK-23292
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Yin Huai
>Priority: Blocker
>
> I was running python tests and found that 
> [pyspark.sql.tests.GroupbyAggPandasUDFTests.test_unsupported_types|https://github.com/apache/spark/blob/52e00f70663a87b5837235bdf72a3e6f84e11411/python/pyspark/sql/tests.py#L4528-L4548]
>  does not run with Python 2 because the test uses "assertRaisesRegex" 
> (supported by Python 3) instead of "assertRaisesRegexp" (supported by Python 
> 2). However, spark jenkins does not fail because of this issue (see run 
> history at 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/]).
>  After looking into this issue, [seems test script will skip tests related to 
> pandas if pandas is not 
> installed|https://github.com/apache/spark/blob/2ac895be909de7e58e1051dc2a1bba98a25bf4be/python/pyspark/sql/tests.py#L51-L63],
>  which means that jenkins does not have pandas installed. 
>  
> Since pyarrow related tests have the same skipping logic, we will need to 
> check if jenkins has pyarrow installed correctly as well. 
>  
> Since features using pandas and pyarrow are in 2.3, we should fix the test 
> issue and make sure all tests pass before we make the release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23292) python tests related to pandas are skipped

2018-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23292:


Assignee: (was: Apache Spark)

> python tests related to pandas are skipped
> --
>
> Key: SPARK-23292
> URL: https://issues.apache.org/jira/browse/SPARK-23292
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Yin Huai
>Priority: Blocker
>
> I was running python tests and found that 
> [pyspark.sql.tests.GroupbyAggPandasUDFTests.test_unsupported_types|https://github.com/apache/spark/blob/52e00f70663a87b5837235bdf72a3e6f84e11411/python/pyspark/sql/tests.py#L4528-L4548]
>  does not run with Python 2 because the test uses "assertRaisesRegex" 
> (supported by Python 3) instead of "assertRaisesRegexp" (supported by Python 
> 2). However, spark jenkins does not fail because of this issue (see run 
> history at 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/]).
>  After looking into this issue, [seems test script will skip tests related to 
> pandas if pandas is not 
> installed|https://github.com/apache/spark/blob/2ac895be909de7e58e1051dc2a1bba98a25bf4be/python/pyspark/sql/tests.py#L51-L63],
>  which means that jenkins does not have pandas installed. 
>  
> Since pyarrow related tests have the same skipping logic, we will need to 
> check if jenkins has pyarrow installed correctly as well. 
>  
> Since features using pandas and pyarrow are in 2.3, we should fix the test 
> issue and make sure all tests pass before we make the release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23292) python tests related to pandas are skipped

2018-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23292:


Assignee: Apache Spark

> python tests related to pandas are skipped
> --
>
> Key: SPARK-23292
> URL: https://issues.apache.org/jira/browse/SPARK-23292
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Blocker
>
> I was running python tests and found that 
> [pyspark.sql.tests.GroupbyAggPandasUDFTests.test_unsupported_types|https://github.com/apache/spark/blob/52e00f70663a87b5837235bdf72a3e6f84e11411/python/pyspark/sql/tests.py#L4528-L4548]
>  does not run with Python 2 because the test uses "assertRaisesRegex" 
> (supported by Python 3) instead of "assertRaisesRegexp" (supported by Python 
> 2). However, spark jenkins does not fail because of this issue (see run 
> history at 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/]).
>  After looking into this issue, [seems test script will skip tests related to 
> pandas if pandas is not 
> installed|https://github.com/apache/spark/blob/2ac895be909de7e58e1051dc2a1bba98a25bf4be/python/pyspark/sql/tests.py#L51-L63],
>  which means that jenkins does not have pandas installed. 
>  
> Since pyarrow related tests have the same skipping logic, we will need to 
> check if jenkins has pyarrow installed correctly as well. 
>  
> Since features using pandas and pyarrow are in 2.3, we should fix the test 
> issue and make sure all tests pass before we make the release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10884) Support prediction on single instance for regression and classification related models

2018-01-31 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348009#comment-16348009
 ] 

Weichen Xu commented on SPARK-10884:


[~mingma] sure. But need to wait until spark 2.3 released. [~yanboliang]

> Support prediction on single instance for regression and classification 
> related models
> --
>
> Key: SPARK-10884
> URL: https://issues.apache.org/jira/browse/SPARK-10884
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Major
>  Labels: 2.2.0
>
> Support prediction on single instance for regression and classification 
> related models (i.e., PredictionModel, ClassificationModel and their sub 
> classes). 
> Add corresponding test cases.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23281) Query produces results in incorrect order when a composite order by clause refers to both original columns and aliases

2018-01-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-23281:
---

Assignee: Dilip Biswal

> Query produces results in incorrect order when a composite order by clause 
> refers to both original columns and aliases
> --
>
> Key: SPARK-23281
> URL: https://issues.apache.org/jira/browse/SPARK-23281
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.2.2, 2.3.0
>
>
> Here is the test snippet.
> {code}
> scala> Seq[(Integer, Integer)](
>  | (1, 1),
>  | (1, 3),
>  | (2, 3),
>  | (3, 3),
>  | (4, null),
>  | (5, null)
>  |   ).toDF("key", "value").createOrReplaceTempView("src")
> scala> sql(
>  | """
>  |   |SELECT MAX(value) as value, key as col2
>  |   |FROM src
>  |   |GROUP BY key
>  |   |ORDER BY value desc, key
>  | """.stripMargin).show
> +-++
> |value|col2|
> +-++
> |3|   3|
> |3|   2|
> |3|   1|
> | null|   5|
> | null|   4|
> +-++
> {code}
> Here is the explain output :
> {code}
> == Parsed Logical Plan ==
> 'Sort ['value DESC NULLS LAST, 'key ASC NULLS FIRST], true
> +- 'Aggregate ['key], ['MAX('value) AS value#9, 'key AS col2#10]
>+- 'UnresolvedRelation `src`
> == Analyzed Logical Plan ==
> value: int, col2: int
> Project [value#9, col2#10]
> +- Sort [value#9 DESC NULLS LAST, col2#10 DESC NULLS LAST], true
>+- Aggregate [key#5], [max(value#6) AS value#9, key#5 AS col2#10]
>   +- SubqueryAlias src
>  +- Project [_1#2 AS key#5, _2#3 AS value#6]
> +- LocalRelation [_1#2, _2#3]
> {code}
> The sort direction should be ascending for the 2nd column. Instead its being 
> changed
> to descending in Analyzer.resolveAggregateFunctions.
> The above testcase models TPCDS-Q71 and thus we have the same issue in Q71 as 
> well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23281) Query produces results in incorrect order when a composite order by clause refers to both original columns and aliases

2018-01-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23281.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.2

> Query produces results in incorrect order when a composite order by clause 
> refers to both original columns and aliases
> --
>
> Key: SPARK-23281
> URL: https://issues.apache.org/jira/browse/SPARK-23281
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dilip Biswal
>Priority: Major
> Fix For: 2.2.2, 2.3.0
>
>
> Here is the test snippet.
> {code}
> scala> Seq[(Integer, Integer)](
>  | (1, 1),
>  | (1, 3),
>  | (2, 3),
>  | (3, 3),
>  | (4, null),
>  | (5, null)
>  |   ).toDF("key", "value").createOrReplaceTempView("src")
> scala> sql(
>  | """
>  |   |SELECT MAX(value) as value, key as col2
>  |   |FROM src
>  |   |GROUP BY key
>  |   |ORDER BY value desc, key
>  | """.stripMargin).show
> +-++
> |value|col2|
> +-++
> |3|   3|
> |3|   2|
> |3|   1|
> | null|   5|
> | null|   4|
> +-++
> {code}
> Here is the explain output :
> {code}
> == Parsed Logical Plan ==
> 'Sort ['value DESC NULLS LAST, 'key ASC NULLS FIRST], true
> +- 'Aggregate ['key], ['MAX('value) AS value#9, 'key AS col2#10]
>+- 'UnresolvedRelation `src`
> == Analyzed Logical Plan ==
> value: int, col2: int
> Project [value#9, col2#10]
> +- Sort [value#9 DESC NULLS LAST, col2#10 DESC NULLS LAST], true
>+- Aggregate [key#5], [max(value#6) AS value#9, key#5 AS col2#10]
>   +- SubqueryAlias src
>  +- Project [_1#2 AS key#5, _2#3 AS value#6]
> +- LocalRelation [_1#2, _2#3]
> {code}
> The sort direction should be ascending for the 2nd column. Instead its being 
> changed
> to descending in Analyzer.resolveAggregateFunctions.
> The above testcase models TPCDS-Q71 and thus we have the same issue in Q71 as 
> well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21396) Spark Hive Thriftserver doesn't return UDT field

2018-01-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21396.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Spark Hive Thriftserver doesn't return UDT field
> 
>
> Key: SPARK-21396
> URL: https://issues.apache.org/jira/browse/SPARK-21396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Haopu Wang
>Assignee: Ken Tore Tallakstad
>Priority: Major
>  Labels: Hive, ThriftServer2, user-defined-type
> Fix For: 2.3.0
>
>
> I want to query a table with a MLLib Vector field and get below exception.
> Can Spark Hive Thriftserver be enhanced to return UDT field?
> ==
> 2017-07-13 13:14:25,435 WARN  
> [org.apache.hive.service.cli.thrift.ThriftCLIService] 
> (HiveServer2-Handler-Pool: Thread-18537;) Error fetching results: 
> java.lang.RuntimeException: scala.MatchError: 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 (of class 
> org.apache.spark.ml.linalg.VectorUDT)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
>   at com.sun.proxy.$Proxy29.fetchResults(Unknown Source)
>   at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454)
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: scala.MatchError: org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 
> (of class org.apache.spark.ml.linalg.VectorUDT)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(SparkExecuteStatementOperation.scala:80)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:144)
>   at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:685)
>   at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
>   ... 18 more



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21396) Spark Hive Thriftserver doesn't return UDT field

2018-01-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-21396:
---

Assignee: Ken Tore Tallakstad

> Spark Hive Thriftserver doesn't return UDT field
> 
>
> Key: SPARK-21396
> URL: https://issues.apache.org/jira/browse/SPARK-21396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Haopu Wang
>Assignee: Ken Tore Tallakstad
>Priority: Major
>  Labels: Hive, ThriftServer2, user-defined-type
> Fix For: 2.3.0
>
>
> I want to query a table with a MLLib Vector field and get below exception.
> Can Spark Hive Thriftserver be enhanced to return UDT field?
> ==
> 2017-07-13 13:14:25,435 WARN  
> [org.apache.hive.service.cli.thrift.ThriftCLIService] 
> (HiveServer2-Handler-Pool: Thread-18537;) Error fetching results: 
> java.lang.RuntimeException: scala.MatchError: 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 (of class 
> org.apache.spark.ml.linalg.VectorUDT)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
>   at com.sun.proxy.$Proxy29.fetchResults(Unknown Source)
>   at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454)
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: scala.MatchError: org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 
> (of class org.apache.spark.ml.linalg.VectorUDT)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(SparkExecuteStatementOperation.scala:80)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:144)
>   at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:685)
>   at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
>   ... 18 more



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23268) Reorganize packages in data source V2

2018-01-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23268.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 2.3.0

> Reorganize packages in data source V2
> -
>
> Key: SPARK-23268
> URL: https://issues.apache.org/jira/browse/SPARK-23268
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> 1. create a new package for partitioning/distribution related classes
> 2. move streaming related class to package
> org.apache.spark.sql.sources.v2.reader/writer.streaming, instead of 
> org.apache.spark.sql.sources.v2.streaming.reader/writer



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23292) python tests related to pandas are skipped

2018-01-31 Thread Yin Huai (JIRA)

Yin Huai created SPARK-23292:


 Summary: python tests related to pandas are skipped
 Key: SPARK-23292
 URL: https://issues.apache.org/jira/browse/SPARK-23292
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.3.0
Reporter: Yin Huai


I was running python tests and found that 
[pyspark.sql.tests.GroupbyAggPandasUDFTests.test_unsupported_types|https://github.com/apache/spark/blob/52e00f70663a87b5837235bdf72a3e6f84e11411/python/pyspark/sql/tests.py#L4528-L4548]
 does not run with Python 2 because the test uses "assertRaisesRegex" 
(supported by Python 3) instead of "assertRaisesRegexp" (supported by Python 
2). However, spark jenkins does not fail because of this issue (see run history 
at 
[here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/]).
 After looking into this issue, [seems test script will skip tests related to 
pandas if pandas is not 
installed|https://github.com/apache/spark/blob/2ac895be909de7e58e1051dc2a1bba98a25bf4be/python/pyspark/sql/tests.py#L51-L63],
 which means that jenkins does not have pandas installed. 
 
Since pyarrow related tests have the same skipping logic, we will need to check 
if jenkins has pyarrow installed correctly as well. 
 
Since features using pandas and pyarrow are in 2.3, we should fix the test 
issue and make sure all tests pass before we make the release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23247) combines Unsafe operations and statistics operations in Scan Data Source

2018-01-31 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23247.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20415
[https://github.com/apache/spark/pull/20415]

> combines Unsafe operations and statistics operations in Scan Data Source
> 
>
> Key: SPARK-23247
> URL: https://issues.apache.org/jira/browse/SPARK-23247
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>Assignee: caoxuewen
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, we scan the execution plan of the data source, first the unsafe 
> operation of each row of data, and then re traverse the data for the count of 
> rows. In terms of performance, this is not necessary. this PR combines the 
> two operations and makes statistics on the number of rows while performing 
> the unsafe operation.
> *Before modified,*
> {color:#cc7832}val {color}unsafeRow = rdd.mapPartitionsWithIndexInternal { 
> (index{color:#cc7832}, {color}iter) =>
>  {color:#cc7832}val {color}proj = 
> UnsafeProjection.create({color:#9876aa}schema{color})
>  proj.initialize(index)
>  {color:#FF}iter.map(proj){color}
> }
> {color:#cc7832}val {color}numOutputRows = 
> longMetric({color:#6a8759}"numOutputRows"{color})
> unsafeRow.map { r =>
>  {color:#FF}numOutputRows += {color}{color:#6897bb}{color:#FF}1{color}
> {color} r
> }
> *After modified,*
>     val numOutputRows = longMetric("numOutputRows")
>     rdd.mapPartitionsWithIndexInternal { (index, iter) =>
>   val proj = UnsafeProjection.create(schema)
>   proj.initialize(index)
>   iter.map( r => {
> {color:#FF}    numOutputRows += 1{color}
> {color:#FF}    proj(r){color}
>   })
>     }
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23247) combines Unsafe operations and statistics operations in Scan Data Source

2018-01-31 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23247:
---

Assignee: caoxuewen

> combines Unsafe operations and statistics operations in Scan Data Source
> 
>
> Key: SPARK-23247
> URL: https://issues.apache.org/jira/browse/SPARK-23247
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>Assignee: caoxuewen
>Priority: Major
>
> Currently, we scan the execution plan of the data source, first the unsafe 
> operation of each row of data, and then re traverse the data for the count of 
> rows. In terms of performance, this is not necessary. this PR combines the 
> two operations and makes statistics on the number of rows while performing 
> the unsafe operation.
> *Before modified,*
> {color:#cc7832}val {color}unsafeRow = rdd.mapPartitionsWithIndexInternal { 
> (index{color:#cc7832}, {color}iter) =>
>  {color:#cc7832}val {color}proj = 
> UnsafeProjection.create({color:#9876aa}schema{color})
>  proj.initialize(index)
>  {color:#FF}iter.map(proj){color}
> }
> {color:#cc7832}val {color}numOutputRows = 
> longMetric({color:#6a8759}"numOutputRows"{color})
> unsafeRow.map { r =>
>  {color:#FF}numOutputRows += {color}{color:#6897bb}{color:#FF}1{color}
> {color} r
> }
> *After modified,*
>     val numOutputRows = longMetric("numOutputRows")
>     rdd.mapPartitionsWithIndexInternal { (index, iter) =>
>   val proj = UnsafeProjection.create(schema)
>   proj.initialize(index)
>   iter.map( r => {
> {color:#FF}    numOutputRows += 1{color}
> {color:#FF}    proj(r){color}
>   })
>     }
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2018-01-31 Thread Gaurav Garg (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347982#comment-16347982
 ] 

Gaurav Garg commented on SPARK-18016:
-

[~kiszk], I have Spark 2.2.0 environment, will have to co-ordinate admin team 
for update, but I have tried with updated jar dependencies in my code. Is that 
not fine ?

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: 910825_9.zip
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at

[jira] [Resolved] (SPARK-23280) add map type support to ColumnVector

2018-01-31 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23280.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20450
[https://github.com/apache/spark/pull/20450]

> add map type support to ColumnVector
> 
>
> Key: SPARK-23280
> URL: https://issues.apache.org/jira/browse/SPARK-23280
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-01-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347967#comment-16347967
 ] 

Apache Spark commented on SPARK-23291:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/20464

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Narendra
>Priority: Major
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23291:


Assignee: (was: Apache Spark)

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Narendra
>Priority: Major
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23291:


Assignee: Apache Spark

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Narendra
>Assignee: Apache Spark
>Priority: Major
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version

2018-01-31 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347922#comment-16347922
 ] 

Dongjoon Hyun commented on SPARK-14492:
---

Hi, [~smilegator].

According to the doc and code, do we officially support old HMS like 0.14.0? 
Today, I tried `--conf spark.hadoop.hive.metastore.uris=thrift://xxx:9083 
--conf spark.sql.hive.metastore.version=0.14.0 --conf 
spark.sql.hive.metastore.jars=/xxx/yyy/*`, but I met the following errors.

{code}
18/02/01 02:07:00 WARN hive.metastore: set_ugi() not successful, Likely cause: 
new client talking to old server. Continuing without it.
...
org.apache.thrift.transport.TTransportException: 
java.net.SocketTimeoutException: Read timed out
at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
{code}

> Spark SQL 1.6.0 does not work with external Hive metastore version lower than 
> 1.2.0; its not backwards compatible with earlier version
> --
>
> Key: SPARK-14492
> URL: https://issues.apache.org/jira/browse/SPARK-14492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunil Rangwani
>Priority: Critical
>
> Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
> java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
> because this field was introduced in Hive 1.2.0 so its not possible to use 
> Hive metastore version lower than 1.2.0 with Spark. The details of the Hive 
> changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 
> {code:java}
> Exception in thread "main" java.lang.NoSuchFieldError: 
> METASTORE_CLIENT_SOCKET_LIFETIME
>   at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet

2018-01-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-23157:
--

> withColumn fails for a column that is a result of mapped DataSet
> 
>
> Key: SPARK-23157
> URL: https://issues.apache.org/jira/browse/SPARK-23157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Tomasz Bartczak
>Priority: Minor
> Fix For: 2.3.0
>
>
> Having 
> {code:java}
> case class R(id: String)
> val ds = spark.createDataset(Seq(R("1")))
> {code}
> This works:
> {code}
> scala> ds.withColumn("n", ds.col("id"))
> res16: org.apache.spark.sql.DataFrame = [id: string, n: string]
> {code}
> but when we map over ds it fails:
> {code}
> scala> ds.withColumn("n", ds.map(a => a).col("id"))
> org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing 
> from id#4 in operator !Project [id#4, id#55 AS n#57];;
> !Project [id#4, id#55 AS n#57]
> +- LocalRelation [id#4]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:1150)
>   at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet

2018-01-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23157:


Assignee: Henry Robinson

> withColumn fails for a column that is a result of mapped DataSet
> 
>
> Key: SPARK-23157
> URL: https://issues.apache.org/jira/browse/SPARK-23157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Tomasz Bartczak
>Assignee: Henry Robinson
>Priority: Minor
> Fix For: 2.3.0
>
>
> Having 
> {code:java}
> case class R(id: String)
> val ds = spark.createDataset(Seq(R("1")))
> {code}
> This works:
> {code}
> scala> ds.withColumn("n", ds.col("id"))
> res16: org.apache.spark.sql.DataFrame = [id: string, n: string]
> {code}
> but when we map over ds it fails:
> {code}
> scala> ds.withColumn("n", ds.map(a => a).col("id"))
> org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing 
> from id#4 in operator !Project [id#4, id#55 AS n#57];;
> !Project [id#4, id#55 AS n#57]
> +- LocalRelation [id#4]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:1150)
>   at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet

2018-01-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23157.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20443
[https://github.com/apache/spark/pull/20443]

> withColumn fails for a column that is a result of mapped DataSet
> 
>
> Key: SPARK-23157
> URL: https://issues.apache.org/jira/browse/SPARK-23157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Tomasz Bartczak
>Assignee: Henry Robinson
>Priority: Minor
> Fix For: 2.3.0
>
>
> Having 
> {code:java}
> case class R(id: String)
> val ds = spark.createDataset(Seq(R("1")))
> {code}
> This works:
> {code}
> scala> ds.withColumn("n", ds.col("id"))
> res16: org.apache.spark.sql.DataFrame = [id: string, n: string]
> {code}
> but when we map over ds it fails:
> {code}
> scala> ds.withColumn("n", ds.map(a => a).col("id"))
> org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing 
> from id#4 in operator !Project [id#4, id#55 AS n#57];;
> !Project [id#4, id#55 AS n#57]
> +- LocalRelation [id#4]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:1150)
>   at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10884) Support prediction on single instance for regression and classification related models

2018-01-31 Thread Ming Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347881#comment-16347881
 ] 

Ming Ma commented on SPARK-10884:
-

Thanks [~WeichenXu123] and [~yanboliang]. While the long-term goal is to 
support "Pipeline for single instance" functionality, this specific patch is 
still quite useful. Any chance we can get it into the master branch soon?

> Support prediction on single instance for regression and classification 
> related models
> --
>
> Key: SPARK-10884
> URL: https://issues.apache.org/jira/browse/SPARK-10884
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Major
>  Labels: 2.2.0
>
> Support prediction on single instance for regression and classification 
> related models (i.e., PredictionModel, ClassificationModel and their sub 
> classes). 
> Add corresponding test cases.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-01-31 Thread Narendra (JIRA)

Narendra created SPARK-23291:


 Summary: SparkR : substr : In SparkR dataframe , starting and 
ending position arguments in "substr" is giving wrong result  when the position 
is greater than 1
 Key: SPARK-23291
 URL: https://issues.apache.org/jira/browse/SPARK-23291
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.1
Reporter: Narendra


Defect Description :

-

For example ,an input string "2017-12-01" is read into a SparkR dataframe "df" 
with column name "col1".
 The target is to create a a new column named "col2" with the value "12" which 
is inside the string ."12" can be extracted with "starting position" as "6" and 
"Ending position" as "7"
 (the starting position of the first character is considered as "1" )

But,the current code that needs to be written is :
 
 df <- withColumn(df,"col2",substr(df$col1,7,8)))

Observe that the first argument in the "substr" API , which indicates the 
'starting position', is mentioned as "7" 
 Also, observe that the second argument in the "substr" API , which indicates 
the 'ending position', is mentioned as "8"

i.e the number that should be mentioned to indicate the position should be the 
"actual position + 1"

Expected behavior :



The code that needs to be written is :
 
 df <- withColumn(df,"col2",substr(df$col1,6,7)))


Note :

---
 This defect is observed with only when the starting position is greater than 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23251) ClassNotFoundException: scala.Any when there's a missing implicit Map encoder

2018-01-31 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-23251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347808#comment-16347808
 ] 

Michal Šenkýř commented on SPARK-23251:
---

Yes, this does seem like the Map encoder is not checking whether appropriate 
encoders exist for the key and value types. I think I couldn't get the compiler 
to resolve it if I added the appropriate typeclass checks and wanted to support 
subclasses of the collection type at the same time. I will check in the 
following few days whether that was the case and try to figure out some 
alternative.

> ClassNotFoundException: scala.Any when there's a missing implicit Map encoder
> -
>
> Key: SPARK-23251
> URL: https://issues.apache.org/jira/browse/SPARK-23251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: mac os high sierra, centos 7
>Reporter: Bruce Robbins
>Priority: Minor
>
> In branch-2.2, when you attempt to use row.getValuesMap[Any] without an 
> implicit Map encoder, you get a nice descriptive compile-time error:
> {noformat}
> scala> df.map(row => row.getValuesMap[Any](List("stationName", 
> "year"))).collect
> :26: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing spark.implicits._  Support for serializing other types 
> will be added in future releases.
>        df.map(row => row.getValuesMap[Any](List("stationName", 
> "year"))).collect
>              ^
> scala> implicit val mapEncoder = 
> org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
> mapEncoder: org.apache.spark.sql.Encoder[Map[String,Any]] = class[value[0]: 
> binary]
> scala> df.map(row => row.getValuesMap[Any](List("stationName", 
> "year"))).collect
> res1: Array[Map[String,Any]] = Array(Map(stationName -> 007026 9, year -> 
> 2014), Map(stationName -> 007026 9, year -> 2014), Map(stationName -> 
> 007026 9, year -> 2014),
> etc...
> {noformat}
>  
>  On the latest master and also on branch-2.3, the transformation compiles (at 
> least on spark-shell), but throws a ClassNotFoundException:
>  
> {noformat}
> scala> df.map(row => row.getValuesMap[Any](List("stationName", 
> "year"))).collect
> java.lang.ClassNotFoundException: scala.Any
>  at 
> scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>  at java.lang.Class.forName0(Native Method)
>  at java.lang.Class.forName(Class.java:348)
>  at 
> scala.reflect.runtime.JavaMirrors$JavaMirror.javaClass(JavaMirrors.scala:555)
>  at 
> scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1211)
>  at 
> scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1203)
>  at 
> scala.reflect.runtime.TwoWayCaches$TwoWayCache$$anonfun$toJava$1.apply(TwoWayCaches.scala:49)
>  at scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19)
>  at scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16)
>  at 
> scala.reflect.runtime.TwoWayCaches$TwoWayCache.toJava(TwoWayCaches.scala:44)
>  at 
> scala.reflect.runtime.JavaMirrors$JavaMirror.classToJava(JavaMirrors.scala:1203)
>  at 
> scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:194)
>  at 
> scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:54)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.getClassFromType(ScalaReflection.scala:700)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:84)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:65)
>  at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor(ScalaReflection.scala:64)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:512)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445)
>  at 
>

[jira] [Created] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe

2018-01-31 Thread Andre Menck (JIRA)

Andre Menck created SPARK-23290:
---

 Summary: inadvertent change in handling of DateType when 
converting to pandas dataframe
 Key: SPARK-23290
 URL: https://issues.apache.org/jira/browse/SPARK-23290
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Andre Menck


In [this 
PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968]
 there was a change in how `DateType` is being returned to users (line 1968 in 
dataframe.py). This can cause client code to fail, as in the following example 
from a python terminal:

{code:python}
>>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
>>> pdf.dtypes
dateobject
num  int64
dtype: object
>>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
02015-01-01
Name: date, dtype: object
>>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
>>> pdf.dtypes
dateobject
num  int64
dtype: object
>>> pdf['date'] = pd.to_datetime(pdf['date'])
>>> pdf.dtypes
datedatetime64[ns]
num  int64
dtype: object
>>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", 
line 2355, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/src/inference.pyx", line 1574, in 
pandas._libs.lib.map_infer
  File "", line 1, in 
TypeError: strptime() argument 1 must be string, not Timestamp
>>> 
{code}

Above we show both the old behavior (returning an "object" col) and the new 
behavior (returning a datetime column). Since there may be user code relying on 
the old behavior, I'd suggest reverting this specific part of this change. Also 
note that the NOTE on the docstring for the "_to_corrected_pandas_type" seems 
to be off, referring to the old behavior and not the current one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19209) "No suitable driver" on first try

2018-01-31 Thread Tony Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347700#comment-16347700
 ] 

Tony Xu edited comment on SPARK-19209 at 1/31/18 10:26 PM:
---

This seems to be a forgotten issue but I'm still experiencing it in Spark 2.2.1

Could this issue be related to the Driver itself? For example, I tried using 
the MySQL JDBC driver and that seems to work fine on the first try. However, 
when I try using Snowflake's JDBC driver, I run into this exact issue.

I'm not sure what the difference between these two drivers are but it might be 
worth digging into


was (Author: txu0393):
This seems like a forgotten issue but I'm still experiencing it in Spark 2.2.1

Could this issue be related to the Driver itself? For example, I tried using 
the MySQL JDBC driver and that seems to work fine on the first try. However, 
when I try using Snowflake's JDBC driver, I run into this exact issue.

I'm not sure what the difference between these two drivers are but it might be 
worth digging into

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>Priority: Critical
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19209) "No suitable driver" on first try

2018-01-31 Thread Tony Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347700#comment-16347700
 ] 

Tony Xu commented on SPARK-19209:
-

This seems like a forgotten issue but I'm still experiencing it in Spark 2.2.1

Could this issue be related to the Driver itself? For example, I tried using 
the MySQL JDBC driver and that seems to work fine on the first try. However, 
when I try using Snowflake's JDBC driver, I run into this exact issue.

I'm not sure what the difference between these two drivers are but it might be 
worth digging into

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>Priority: Critical
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347684#comment-16347684
 ] 

Apache Spark commented on SPARK-23020:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20462

> Re-enable Flaky Test: 
> org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> 
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23289) OneForOneBlockFetcher.DownloadCallback.onData may write just a part of data

2018-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23289:


Assignee: Apache Spark  (was: Shixiong Zhu)

> OneForOneBlockFetcher.DownloadCallback.onData may write just a part of data
> ---
>
> Key: SPARK-23289
> URL: https://issues.apache.org/jira/browse/SPARK-23289
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23289) OneForOneBlockFetcher.DownloadCallback.onData may write just a part of data

2018-01-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347679#comment-16347679
 ] 

Apache Spark commented on SPARK-23289:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/20461

> OneForOneBlockFetcher.DownloadCallback.onData may write just a part of data
> ---
>
> Key: SPARK-23289
> URL: https://issues.apache.org/jira/browse/SPARK-23289
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23289) OneForOneBlockFetcher.DownloadCallback.onData may write just a part of data

2018-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23289:


Assignee: Shixiong Zhu  (was: Apache Spark)

> OneForOneBlockFetcher.DownloadCallback.onData may write just a part of data
> ---
>
> Key: SPARK-23289
> URL: https://issues.apache.org/jira/browse/SPARK-23289
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23206) Additional Memory Tuning Metrics

2018-01-31 Thread Edwina Lu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edwina Lu updated SPARK-23206:
--
Attachment: StageTab.png

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, MemoryTuningMetricsDesignDoc.pdf, 
> StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23206) Additional Memory Tuning Metrics

2018-01-31 Thread Edwina Lu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edwina Lu updated SPARK-23206:
--
Attachment: (was: StageTab.png)

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, MemoryTuningMetricsDesignDoc.pdf
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23289) OneForOneBlockFetcher.DownloadCallback.onData may write just a part of data

2018-01-31 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-23289:


 Summary: OneForOneBlockFetcher.DownloadCallback.onData may write 
just a part of data
 Key: SPARK-23289
 URL: https://issues.apache.org/jira/browse/SPARK-23289
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1, 2.2.0, 2.3.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-31 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347658#comment-16347658
 ] 

Yanbo Liang commented on SPARK-23107:
-

[~sameerag] Yes, the fix should be API scope change, but I think we can get 
them merged in one or two days. When do you plan to cut the next RC? Thanks.

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA 
> issue (SPARK-23111 for {{2.3}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23285) Allow spark.executor.cores to be fractional

2018-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23285:


Assignee: (was: Apache Spark)

> Allow spark.executor.cores to be fractional
> ---
>
> Key: SPARK-23285
> URL: https://issues.apache.org/jira/browse/SPARK-23285
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Scheduler, Spark Submit
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> There is a strong check for an integral number of cores per executor in 
> [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272].
>  Given we're reusing that property in K8s, does it make sense to relax it?
>  
> K8s treats CPU as a "compressible resource" and can actually assign millicpus 
> to individual containers. Also to be noted - spark.driver.cores has no such 
> check in place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23285) Allow spark.executor.cores to be fractional

2018-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23285:


Assignee: Apache Spark

> Allow spark.executor.cores to be fractional
> ---
>
> Key: SPARK-23285
> URL: https://issues.apache.org/jira/browse/SPARK-23285
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Scheduler, Spark Submit
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Assignee: Apache Spark
>Priority: Minor
>
> There is a strong check for an integral number of cores per executor in 
> [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272].
>  Given we're reusing that property in K8s, does it make sense to relax it?
>  
> K8s treats CPU as a "compressible resource" and can actually assign millicpus 
> to individual containers. Also to be noted - spark.driver.cores has no such 
> check in place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23285) Allow spark.executor.cores to be fractional

2018-01-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347651#comment-16347651
 ] 

Apache Spark commented on SPARK-23285:
--

User 'liyinan926' has created a pull request for this issue:
https://github.com/apache/spark/pull/20460

> Allow spark.executor.cores to be fractional
> ---
>
> Key: SPARK-23285
> URL: https://issues.apache.org/jira/browse/SPARK-23285
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Scheduler, Spark Submit
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> There is a strong check for an integral number of cores per executor in 
> [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272].
>  Given we're reusing that property in K8s, does it make sense to relax it?
>  
> K8s treats CPU as a "compressible resource" and can actually assign millicpus 
> to individual containers. Also to be noted - spark.driver.cores has no such 
> check in place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23107:


Assignee: Apache Spark  (was: Yanbo Liang)

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA 
> issue (SPARK-23111 for {{2.3}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23107:


Assignee: Yanbo Liang  (was: Apache Spark)

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA 
> issue (SPARK-23111 for {{2.3}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347592#comment-16347592
 ] 

Apache Spark commented on SPARK-23107:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/20459

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA 
> issue (SPARK-23111 for {{2.3}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-31 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347576#comment-16347576
 ] 

Sameer Agarwal commented on SPARK-23107:


[~yanboliang] other than adding docs, are you considering any pending API 
changes that should block the next RC?

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA 
> issue (SPARK-23111 for {{2.3}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21525) ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL

2018-01-31 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-21525.

   Resolution: Fixed
Fix Version/s: 2.4.0

> ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL
> -
>
> Key: SPARK-21525
> URL: https://issues.apache.org/jira/browse/SPARK-21525
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.4.0
>
>
> {{AddBlock}} returns an error code related to whether writing the block to 
> the WAL was successful or not. In cases where a WAL may be unavailable 
> temporarily, the write would fail but it seems like we are not using the 
> return code (see 
> [here|https://github.com/apache/spark/blob/ba8912e5f3d5c5a366cb3d1f6be91f2471d048d2/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisorImpl.scala#L162]).
> For example, when using the Flume Receiver, we should be sending a n'ack back 
> to Flume if the block wasn't written to the WAL. I haven't gone through the 
> full code path yet but at least from looking at the ReceiverSupervisorImpl, 
> it doesn't seem like that return code is being used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21525) ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL

2018-01-31 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-21525:
--

Assignee: Marcelo Vanzin

> ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL
> -
>
> Key: SPARK-21525
> URL: https://issues.apache.org/jira/browse/SPARK-21525
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>Assignee: Marcelo Vanzin
>Priority: Major
>
> {{AddBlock}} returns an error code related to whether writing the block to 
> the WAL was successful or not. In cases where a WAL may be unavailable 
> temporarily, the write would fail but it seems like we are not using the 
> return code (see 
> [here|https://github.com/apache/spark/blob/ba8912e5f3d5c5a366cb3d1f6be91f2471d048d2/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisorImpl.scala#L162]).
> For example, when using the Flume Receiver, we should be sending a n'ack back 
> to Flume if the block wasn't written to the WAL. I haven't gone through the 
> full code path yet but at least from looking at the ReceiverSupervisorImpl, 
> it doesn't seem like that return code is being used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23285) Allow spark.executor.cores to be fractional

2018-01-31 Thread Yinan Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347484#comment-16347484
 ] 

Yinan Li commented on SPARK-23285:
--

Another option is to bypass that check for Kubernetes mode. This minimizes the 
code changes. Thoughts?

> Allow spark.executor.cores to be fractional
> ---
>
> Key: SPARK-23285
> URL: https://issues.apache.org/jira/browse/SPARK-23285
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Scheduler, Spark Submit
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> There is a strong check for an integral number of cores per executor in 
> [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272].
>  Given we're reusing that property in K8s, does it make sense to relax it?
>  
> K8s treats CPU as a "compressible resource" and can actually assign millicpus 
> to individual containers. Also to be noted - spark.driver.cores has no such 
> check in place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column

2018-01-31 Thread Andrew Ash (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347480#comment-16347480
 ] 

Andrew Ash commented on SPARK-23274:


Many thanks for the fast fix [~smilegator]!

> ReplaceExceptWithFilter fails on dataframes filtered on same column
> ---
>
> Key: SPARK-23274
> URL: https://issues.apache.org/jira/browse/SPARK-23274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Onur Satici
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.3.0
>
>
> Currently affects:
> {code:java}
> $ git tag --contains 01f6ba0e7a
> v2.3.0-rc1
> v2.3.0-rc2
> {code}
> Steps to reproduce:
> {code:java}
> $ cat test.csv
> a,b
> 1,2
> 1,3
> 2,2
> 2,4
> {code}
> {code:java}
> val df = spark.read.format("csv").option("header", "true").load("test.csv")
> val df1 = df.filter($"a" === 1)
> val df2 = df.filter($"a" === 2)
> df1.select("b").except(df2.select("b")).show
> {code}
> results in:
> {code:java}
> java.util.NoSuchElementException: key not found: a
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
>

[jira] [Resolved] (SPARK-20826) Support compression/decompression of ColumnVector in generated code

2018-01-31 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved SPARK-20826.
--
Resolution: Duplicate

> Support compression/decompression of ColumnVector in generated code
> ---
>
> Key: SPARK-20826
> URL: https://issues.apache.org/jira/browse/SPARK-20826
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs

2018-01-31 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347405#comment-16347405
 ] 

Weichen Xu commented on SPARK-23110:


[~yanboliang] OK.

> ML 2.3 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-23110
> URL: https://issues.apache.org/jira/browse/SPARK-23110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Attachments: 1_process_script.sh, added_ml_class, 
> different_methods_in_ML.diff
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20823) Generate code to build table cache using ColumnarBatch and to get value from ColumnVector for other types

2018-01-31 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved SPARK-20823.
--
Resolution: Won't Fix

> Generate code to build table cache using ColumnarBatch and to get value from 
> ColumnVector for other types
> -
>
> Key: SPARK-20823
> URL: https://issues.apache.org/jira/browse/SPARK-20823
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20821) Add compression/decompression of data to ColumnVector for other data types

2018-01-31 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved SPARK-20821.
--
Resolution: Won't Fix

> Add compression/decompression of data to ColumnVector for other data types
> --
>
> Key: SPARK-20821
> URL: https://issues.apache.org/jira/browse/SPARK-20821
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20820) Add compression/decompression of data to ColumnVector for other compression schemes

2018-01-31 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved SPARK-20820.
--
Resolution: Duplicate

> Add compression/decompression of data to ColumnVector for other compression 
> schemes
> ---
>
> Key: SPARK-20820
> URL: https://issues.apache.org/jira/browse/SPARK-20820
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20824) Generate code to get value from table cache with wider column in ColumnarBatch

2018-01-31 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved SPARK-20824.
--
Resolution: Won't Fix

> Generate code to get value from table cache with wider column in ColumnarBatch
> --
>
> Key: SPARK-20824
> URL: https://issues.apache.org/jira/browse/SPARK-20824
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20825) Generate code to get value from table cache with wider column in ColumnarBatch for other data types

2018-01-31 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved SPARK-20825.
--
Resolution: Not A Problem

> Generate code to get value from table cache with wider column in 
> ColumnarBatch for other data types
> ---
>
> Key: SPARK-20825
> URL: https://issues.apache.org/jira/browse/SPARK-20825
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22747) Shorten lifetime of global variables used in HashAggregateExec

2018-01-31 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved SPARK-22747.
--
Resolution: Won't Fix

> Shorten lifetime of global variables used in HashAggregateExec
> --
>
> Key: SPARK-22747
> URL: https://issues.apache.org/jira/browse/SPARK-22747
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> Generated code in {{HashAggregateExec}} uses global mutable variables that 
> are passed to successor operations thru {{consume()}} method. It may cause 
> issue in SPARK-22668.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2018-01-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21657:

Component/s: (was: Spark Core)

> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0
>Reporter: Ruslan Dautkhanov
>Assignee: Ohad Raviv
>Priority: Major
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Fix For: 2.3.0
>
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sized nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling of 50,000 (see attached pyspark script), it took 7 hours to 
> explode the nested collections (\!) of 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs

2018-01-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347385#comment-16347385
 ] 

Apache Spark commented on SPARK-23110:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/20457

> ML 2.3 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-23110
> URL: https://issues.apache.org/jira/browse/SPARK-23110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Attachments: 1_process_script.sh, added_ml_class, 
> different_methods_in_ML.diff
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22036) BigDecimal multiplication sometimes returns null

2018-01-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22036:

Component/s: (was: Spark Core)
 SQL

> BigDecimal multiplication sometimes returns null
> 
>
> Key: SPARK-22036
> URL: https://issues.apache.org/jira/browse/SPARK-22036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Olivier Blanvillain
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.0
>
>
> The multiplication of two BigDecimal numbers sometimes returns null. Here is 
> a minimal reproduction:
> {code:java}
> object Main extends App {
>   import org.apache.spark.{SparkConf, SparkContext}
>   import org.apache.spark.sql.SparkSession
>   import spark.implicits._
>   val conf = new 
> SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
> "false")
>   val spark = 
> SparkSession.builder().config(conf).appName("REPL").getOrCreate()
>   implicit val sqlContext = spark.sqlContext
>   case class X2(a: BigDecimal, b: BigDecimal)
>   val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
> BigDecimal(-1000.1
>   val result = ds.select(ds("a") * ds("b")).collect.head
>   println(result) // [null]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-31 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347386#comment-16347386
 ] 

Yanbo Liang commented on SPARK-23107:
-

[~mlnick] Sorry for late response, really busy recently. This task has been 
almost finished, I will submit the PR today or tomorrow. Thanks.

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA 
> issue (SPARK-23111 for {{2.3}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs

2018-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23110:


Assignee: Apache Spark  (was: Weichen Xu)

> ML 2.3 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-23110
> URL: https://issues.apache.org/jira/browse/SPARK-23110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Blocker
> Attachments: 1_process_script.sh, added_ml_class, 
> different_methods_in_ML.diff
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs

2018-01-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23110:


Assignee: Weichen Xu  (was: Apache Spark)

> ML 2.3 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-23110
> URL: https://issues.apache.org/jira/browse/SPARK-23110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Attachments: 1_process_script.sh, added_ml_class, 
> different_methods_in_ML.diff
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs

2018-01-31 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347381#comment-16347381
 ] 

Yanbo Liang commented on SPARK-23110:
-

[~mlnick] Yes, we should make it \{{private[ml]}}. I can fix it at SPARK-23107, 
thanks for pick it up.

> ML 2.3 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-23110
> URL: https://issues.apache.org/jira/browse/SPARK-23110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Attachments: 1_process_script.sh, added_ml_class, 
> different_methods_in_ML.diff
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22130) UTF8String.trim() inefficiently scans all white-space string twice.

2018-01-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22130:

Component/s: (was: Spark Core)
 SQL

> UTF8String.trim() inefficiently scans all white-space string twice.
> ---
>
> Key: SPARK-22130
> URL: https://issues.apache.org/jira/browse/SPARK-22130
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 2.3.0
>
>
> {{UTF8String.trim()}} scans a string including only white space (e.g. {{"
> "}}) twice inefficiently. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22247) Hive partition filter very slow

2018-01-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22247:

Component/s: (was: Spark Core)

> Hive partition filter very slow
> ---
>
> Key: SPARK-22247
> URL: https://issues.apache.org/jira/browse/SPARK-22247
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1
>Reporter: Patrick Duin
>Priority: Minor
> Fix For: 2.3.0
>
>
> I found an issue where filtering partitions using a dataframe results in very 
> bad performance.
> To reproduce:
> Create a hive table with a lot of partitions and write a spark query on that 
> table that filters based on the partition column.
> In my use case I've got a table with about 30k partitions. 
> I filter the partitions using some scala via spark-shell:
> {{table.filter("partition=x or partition=y")}}
> This results in a Hive thrift API call:{{ #get_partitions('db', 'table', 
> -1)}} which is very slow (minutes) and loads all metastore partitions in 
> memory.
> Doing a more simple filter:
> {{table.filter("partition=x)}} 
> Results in a Hive Thrift API call:{{ #get_partitions_by_filter('db', 'table', 
> 'partition = "x', -1)}} which is very fast (seconds) and only fetches 
> partition X into memory.
> If possible Spark should translate the filter into the more performant Thrift 
> call or fallback to a more scalable solution where it filters our partitions 
> without having to loading them all into memory first (for instance fetching 
> the partitions in batches).
> I've posted my original question on 
> [SO|https://stackoverflow.com/questions/46152526/how-should-i-configure-spark-to-correctly-prune-hive-metastore-partitions]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error

2018-01-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22291:

Component/s: (was: Spark Core)

> Postgresql UUID[] to Cassandra: Conversion Error
> 
>
> Key: SPARK-22291
> URL: https://issues.apache.org/jira/browse/SPARK-22291
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, 
> Cassandra 3
>Reporter: Fabio J. Walter
>Assignee: Jen-Ming Chung
>Priority: Major
>  Labels: patch, postgresql, sql
> Fix For: 2.2.1, 2.3.0
>
> Attachments: 
> org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png
>
>
> My job reads data from a PostgreSQL table that contains columns of user_ids 
> uuid[] type, so that I'm getting the error above when I'm trying to save data 
> on Cassandra.
> However, the creation of this same table on Cassandra works fine!  user_ids 
> list.
> I can't change the type on the source table, because I'm reading data from a 
> legacy system.
> I've been looking at point printed on log, on class 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala
> Stacktrace on Spark:
> {noformat}
> Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to 
> [Ljava.lang.String;
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at

[jira] [Resolved] (SPARK-22935) Dataset with Java Beans for java.sql.Date throws CompileException

2018-01-31 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved SPARK-22935.
--
Resolution: Invalid

> Dataset with Java Beans for java.sql.Date throws CompileException
> -
>
> Key: SPARK-22935
> URL: https://issues.apache.org/jira/browse/SPARK-22935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> The following code can throw an exception with or without whole-stage codegen.
> {code}
>   public void SPARK22935() {
> Dataset cdr = spark
> .read()
> .format("csv")
> .option("header", "true")
> .option("inferSchema", "true")
> .option("delimiter", ";")
> .csv("CDR_SAMPLE.csv")
> .as(Encoders.bean(CDR.class));
> Dataset ds = cdr.filter((FilterFunction) x -> (x.timestamp != 
> null));
> long c = ds.count();
> cdr.show(2);
> ds.show(2);
> System.out.println("cnt=" + c);
>   }
> // CDR.java
> public class CDR implements java.io.Serializable {
>   public java.sql.Date timestamp;
>   public java.sql.Date getTimestamp() { return this.timestamp; }
>   public void setTimestamp(java.sql.Date timestamp) { this.timestamp = 
> timestamp; }
> }
> // CDR_SAMPLE.csv
> timestamp
> 2017-10-29T02:37:07.815Z
> 2017-10-29T02:38:07.815Z
> {code}
> result
> {code}
> 12:17:10.352 ERROR 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 61, Column 70: No applicable constructor/method found 
> for actual parameters "long"; candidates are: "public static java.sql.Date 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)"
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 61, Column 70: No applicable constructor/method found for actual parameters 
> "long"; candidates are: "public static java.sql.Date 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)"
>   at 
> org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs

2018-01-31 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347369#comment-16347369
 ] 

Weichen Xu commented on SPARK-23110:


+1.
Should make it to be `private[ml]`

> ML 2.3 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-23110
> URL: https://issues.apache.org/jira/browse/SPARK-23110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Attachments: 1_process_script.sh, added_ml_class, 
> different_methods_in_ML.diff
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19236) Add createOrReplaceGlobalTempView

2018-01-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19236:

Component/s: (was: Spark Core)
 SQL

> Add createOrReplaceGlobalTempView
> -
>
> Key: SPARK-19236
> URL: https://issues.apache.org/jira/browse/SPARK-19236
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Arman Yazdani
>Assignee: Arman Yazdani
>Priority: Minor
> Fix For: 2.2.0, 2.3.0
>
>
> There are 3 methods for saving a temp tables:
> createTempView
> createOrReplaceTempView
> createGlobalTempView
> but there isn't:
> createOrReplaceGlobalTempView



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 179 matches

Mail list logo