[jira] [Updated] (SPARK-23295) Exclude Waring message when generating versions in make-distribution.sh
[ https://issues.apache.org/jira/browse/SPARK-23295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-23295: - Description: When we specified a wrong profile to make a spark distribution, such as -Phadoop1000, we will get an odd package named like : spark-[WARNING] The requested profile "hadoop1000" could not be activated because it does not exist.-bin-hadoop-2.7.tgz which actually should be `"spark-$VERSION-bin-$NAME.tgz"` was:When we specified a wrong profile to make a spark distribution, such as `-Phadoop1000`, we will get an odd package named like `spark-[WARNING] The requested profile "hadoop1000" could not be activated because it does not exist.-bin-hadoop-2.7.tgz`, which actually should be `"spark-$VERSION-bin-$NAME.tgz"` > Exclude Waring message when generating versions in make-distribution.sh > > > Key: SPARK-23295 > URL: https://issues.apache.org/jira/browse/SPARK-23295 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.1 >Reporter: Kent Yao >Priority: Major > > When we specified a wrong profile to make a spark distribution, such as > -Phadoop1000, we will get an odd package named like : > spark-[WARNING] The requested profile "hadoop1000" could not be activated > because it does not exist.-bin-hadoop-2.7.tgz > which actually should be `"spark-$VERSION-bin-$NAME.tgz"` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23295) Exclude Waring message when generating versions in make-distribution.sh
[ https://issues.apache.org/jira/browse/SPARK-23295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23295: Assignee: Apache Spark > Exclude Waring message when generating versions in make-distribution.sh > > > Key: SPARK-23295 > URL: https://issues.apache.org/jira/browse/SPARK-23295 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.1 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > > When we specified a wrong profile to make a spark distribution, such as > `-Phadoop1000`, we will get an odd package named like `spark-[WARNING] The > requested profile "hadoop1000" could not be activated because it does not > exist.-bin-hadoop-2.7.tgz`, which actually should be > `"spark-$VERSION-bin-$NAME.tgz"` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23295) Exclude Waring message when generating versions in make-distribution.sh
[ https://issues.apache.org/jira/browse/SPARK-23295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348146#comment-16348146 ] Apache Spark commented on SPARK-23295: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/20469 > Exclude Waring message when generating versions in make-distribution.sh > > > Key: SPARK-23295 > URL: https://issues.apache.org/jira/browse/SPARK-23295 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.1 >Reporter: Kent Yao >Priority: Major > > When we specified a wrong profile to make a spark distribution, such as > `-Phadoop1000`, we will get an odd package named like `spark-[WARNING] The > requested profile "hadoop1000" could not be activated because it does not > exist.-bin-hadoop-2.7.tgz`, which actually should be > `"spark-$VERSION-bin-$NAME.tgz"` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23295) Exclude Waring message when generating versions in make-distribution.sh
[ https://issues.apache.org/jira/browse/SPARK-23295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23295: Assignee: (was: Apache Spark) > Exclude Waring message when generating versions in make-distribution.sh > > > Key: SPARK-23295 > URL: https://issues.apache.org/jira/browse/SPARK-23295 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.1 >Reporter: Kent Yao >Priority: Major > > When we specified a wrong profile to make a spark distribution, such as > `-Phadoop1000`, we will get an odd package named like `spark-[WARNING] The > requested profile "hadoop1000" could not be activated because it does not > exist.-bin-hadoop-2.7.tgz`, which actually should be > `"spark-$VERSION-bin-$NAME.tgz"` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23284) Document several get API of ColumnVector's behavior when accessing null slot
[ https://issues.apache.org/jira/browse/SPARK-23284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348145#comment-16348145 ] Wenchen Fan commented on SPARK-23284: - Since this defines the "return null" behavior for `ColumnVector` implementations, it's good to get this in before 2.3 release. But technically Spark won't call `ColumnVector.getXXX` if that slot is null, so it's OK to leave it to 2.4. Thus I'm not going to mark this as a blocker. cc [~sameerag] > Document several get API of ColumnVector's behavior when accessing null slot > > > Key: SPARK-23284 > URL: https://issues.apache.org/jira/browse/SPARK-23284 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > We should clearly document the behavior of some ColumnVector get APIs such as > getBinary, getStruct, getArray, etc., when accessing a null slot. Those APIs > should return null if the slot is null. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23295) Exclude Waring message when generating versions in make-distribution.sh
Kent Yao created SPARK-23295: Summary: Exclude Waring message when generating versions in make-distribution.sh Key: SPARK-23295 URL: https://issues.apache.org/jira/browse/SPARK-23295 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.2.1 Reporter: Kent Yao When we specified a wrong profile to make a spark distribution, such as `-Phadoop1000`, we will get an odd package named like `spark-[WARNING] The requested profile "hadoop1000" could not be activated because it does not exist.-bin-hadoop-2.7.tgz`, which actually should be `"spark-$VERSION-bin-$NAME.tgz"` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23284) Document several get API of ColumnVector's behavior when accessing null slot
[ https://issues.apache.org/jira/browse/SPARK-23284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-23284: Target Version/s: 2.3.0 > Document several get API of ColumnVector's behavior when accessing null slot > > > Key: SPARK-23284 > URL: https://issues.apache.org/jira/browse/SPARK-23284 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > We should clearly document the behavior of some ColumnVector get APIs such as > getBinary, getStruct, getArray, etc., when accessing a null slot. Those APIs > should return null if the slot is null. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23284) Document several get API of ColumnVector's behavior when accessing null slot
[ https://issues.apache.org/jira/browse/SPARK-23284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-23284: Affects Version/s: (was: 2.4.0) 2.3.0 > Document several get API of ColumnVector's behavior when accessing null slot > > > Key: SPARK-23284 > URL: https://issues.apache.org/jira/browse/SPARK-23284 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > We should clearly document the behavior of some ColumnVector get APIs such as > getBinary, getStruct, getArray, etc., when accessing a null slot. Those APIs > should return null if the slot is null. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23202) Add new API in DataSourceWriter: onDataWriterCommit
[ https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-23202: --- Summary: Add new API in DataSourceWriter: onDataWriterCommit (was: Break down DataSourceV2Writer.commit into two phase) > Add new API in DataSourceWriter: onDataWriterCommit > --- > > Key: SPARK-23202 > URL: https://issues.apache.org/jira/browse/SPARK-23202 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Gengliang Wang >Priority: Major > > Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a > writing job with a list of commit messages. > It makes sense in some scenarios, e.g. MicroBatchExecution. > However, on receiving commit message, driver can start processing > messages(e.g. persist messages into files) before all the messages are > collected. > The proposal is to Break down DataSourceV2Writer.commit into two phase: > # add(WriterCommitMessage message): Handles a commit message produced by > \{@link DataWriter#commit()}. > # commit(): Commits the writing job. > This should make the API more flexible, and more reasonable for implementing > some datasources. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23202) Add new API in DataSourceWriter: onDataWriterCommit
[ https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-23202: --- Description: The current DataSourceWriter API makes it hard to implement {{onTaskCommit(taskCommit: TaskCommitMessage)}} in {{FileCommitProtocol}}. In general, on receiving commit message, driver can start processing messages(e.g. persist messages into files) before all the messages are collected. The proposal to add a new API: {{add(WriterCommitMessage message)}}: Handles a commit message on receiving from a successful data writer. This should make the whole API of DataSourceWriter compatible with {{FileCommitProtocol}}, and more flexible. There was another radical attempt in [#20386|https://github.com/apache/spark/pull/20386]. Creating a new API as [#20454|https://github.com/apache/spark/pull/20454] is more reasonable. was: The current DataSourceWriter API makes it hard to implement {{onTaskCommit(taskCommit: TaskCommitMessage)}} in {{FileCommitProtocol}}. In general, on receiving commit message, driver can start processing messages(e.g. persist messages into files) before all the messages are collected. The proposal to add a new API: {{add(WriterCommitMessage message)}}: Handles a commit message on receiving from a successful data writer. This should make the whole API of DataSourceWriter compatible with {{FileCommitProtocol}}, and more flexible. There was another radical attempt in [#20386|https://github.com/apache/spark/pull/20386]. Creating a new API is more reasonable. > Add new API in DataSourceWriter: onDataWriterCommit > --- > > Key: SPARK-23202 > URL: https://issues.apache.org/jira/browse/SPARK-23202 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Gengliang Wang >Priority: Major > > The current DataSourceWriter API makes it hard to implement > {{onTaskCommit(taskCommit: TaskCommitMessage)}} in {{FileCommitProtocol}}. > In general, on receiving commit message, driver can start processing > messages(e.g. persist messages into files) before all the messages are > collected. > The proposal to add a new API: > {{add(WriterCommitMessage message)}}: Handles a commit message on receiving > from a successful data writer. > This should make the whole API of DataSourceWriter compatible with > {{FileCommitProtocol}}, and more flexible. > There was another radical attempt in > [#20386|https://github.com/apache/spark/pull/20386]. Creating a new API as > [#20454|https://github.com/apache/spark/pull/20454] is more reasonable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23202) Add new API in DataSourceWriter: onDataWriterCommit
[ https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-23202: --- Description: The current DataSourceWriter API makes it hard to implement {{onTaskCommit(taskCommit: TaskCommitMessage)}} in {{FileCommitProtocol}}. In general, on receiving commit message, driver can start processing messages(e.g. persist messages into files) before all the messages are collected. The proposal to add a new API: {{add(WriterCommitMessage message)}}: Handles a commit message on receiving from a successful data writer. This should make the whole API of DataSourceWriter compatible with {{FileCommitProtocol}}, and more flexible. There was another radical attempt in [#20386|https://github.com/apache/spark/pull/20386]. Creating a new API is more reasonable. was: Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a writing job with a list of commit messages. It makes sense in some scenarios, e.g. MicroBatchExecution. However, on receiving commit message, driver can start processing messages(e.g. persist messages into files) before all the messages are collected. The proposal is to Break down DataSourceV2Writer.commit into two phase: # add(WriterCommitMessage message): Handles a commit message produced by \{@link DataWriter#commit()}. # commit(): Commits the writing job. This should make the API more flexible, and more reasonable for implementing some datasources. > Add new API in DataSourceWriter: onDataWriterCommit > --- > > Key: SPARK-23202 > URL: https://issues.apache.org/jira/browse/SPARK-23202 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Gengliang Wang >Priority: Major > > The current DataSourceWriter API makes it hard to implement > {{onTaskCommit(taskCommit: TaskCommitMessage)}} in {{FileCommitProtocol}}. > In general, on receiving commit message, driver can start processing > messages(e.g. persist messages into files) before all the messages are > collected. > The proposal to add a new API: > {{add(WriterCommitMessage message)}}: Handles a commit message on receiving > from a successful data writer. > This should make the whole API of DataSourceWriter compatible with > {{FileCommitProtocol}}, and more flexible. > There was another radical attempt in > [#20386|https://github.com/apache/spark/pull/20386]. Creating a new API is > more reasonable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23202) Break down DataSourceV2Writer.commit into two phase
[ https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23202: Priority: Major (was: Blocker) > Break down DataSourceV2Writer.commit into two phase > --- > > Key: SPARK-23202 > URL: https://issues.apache.org/jira/browse/SPARK-23202 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Gengliang Wang >Priority: Major > > Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a > writing job with a list of commit messages. > It makes sense in some scenarios, e.g. MicroBatchExecution. > However, on receiving commit message, driver can start processing > messages(e.g. persist messages into files) before all the messages are > collected. > The proposal is to Break down DataSourceV2Writer.commit into two phase: > # add(WriterCommitMessage message): Handles a commit message produced by > \{@link DataWriter#commit()}. > # commit(): Commits the writing job. > This should make the API more flexible, and more reasonable for implementing > some datasources. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23202) Break down DataSourceV2Writer.commit into two phase
[ https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23202: Target Version/s: 2.4.0 (was: 2.3.0) > Break down DataSourceV2Writer.commit into two phase > --- > > Key: SPARK-23202 > URL: https://issues.apache.org/jira/browse/SPARK-23202 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Gengliang Wang >Priority: Major > > Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a > writing job with a list of commit messages. > It makes sense in some scenarios, e.g. MicroBatchExecution. > However, on receiving commit message, driver can start processing > messages(e.g. persist messages into files) before all the messages are > collected. > The proposal is to Break down DataSourceV2Writer.commit into two phase: > # add(WriterCommitMessage message): Handles a commit message produced by > \{@link DataWriter#commit()}. > # commit(): Commits the writing job. > This should make the API more flexible, and more reasonable for implementing > some datasources. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23280) add map type support to ColumnVector
[ https://issues.apache.org/jira/browse/SPARK-23280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348109#comment-16348109 ] Apache Spark commented on SPARK-23280: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/20468 > add map type support to ColumnVector > > > Key: SPARK-23280 > URL: https://issues.apache.org/jira/browse/SPARK-23280 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10884) Support prediction on single instance for regression and classification related models
[ https://issues.apache.org/jira/browse/SPARK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348108#comment-16348108 ] Weichen Xu commented on SPARK-10884: [~mingma] I hope so but now the code freeze and is QAing. Sorry for that ! > Support prediction on single instance for regression and classification > related models > -- > > Key: SPARK-10884 > URL: https://issues.apache.org/jira/browse/SPARK-10884 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Major > Labels: 2.2.0 > > Support prediction on single instance for regression and classification > related models (i.e., PredictionModel, ClassificationModel and their sub > classes). > Add corresponding test cases. > See parent issue for more details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10884) Support prediction on single instance for regression and classification related models
[ https://issues.apache.org/jira/browse/SPARK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348094#comment-16348094 ] Ming Ma commented on SPARK-10884: - Cool. Could we get it in for 2.3? This will bring Spark one step closer providing real-time prediction and getting it to 2.3 will make it available to more applications sooner. Also the patch looks pretty straightforward and the risk seems pretty low. > Support prediction on single instance for regression and classification > related models > -- > > Key: SPARK-10884 > URL: https://issues.apache.org/jira/browse/SPARK-10884 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Major > Labels: 2.2.0 > > Support prediction on single instance for regression and classification > related models (i.e., PredictionModel, ClassificationModel and their sub > classes). > Add corresponding test cases. > See parent issue for more details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23294) Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated
[ https://issues.apache.org/jira/browse/SPARK-23294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravinder Matte updated SPARK-23294: --- Labels: ConsoleSink RateSource backpressure maxRate (was: ConsoleSink RateSource) > Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated > --- > > Key: SPARK-23294 > URL: https://issues.apache.org/jira/browse/SPARK-23294 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.1 >Reporter: Ravinder Matte >Assignee: Takeshi Yamamuro >Priority: Minor > Labels: ConsoleSink, RateSource, backpressure, maxRate > > Using following configs while building SparkSession > ("spark.streaming.backpressure.enabled", "true") > ("spark.streaming.receiver.maxRate", "100") > ("spark.streaming.backpressure.initialRate", "100") > > Source: Rate Source with following options. > rowsPerSecond=10 > Sink: Console Sink. > > I am expecting the process rate to limit at 100 Rows per Second, but maxRate > is ignored and streaming job is processing at 10 rate. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23294) Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated
[ https://issues.apache.org/jira/browse/SPARK-23294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravinder Matte updated SPARK-23294: --- Description: Using following configs while building SparkSession ("spark.streaming.backpressure.enabled", "true") ("spark.streaming.receiver.maxRate", "100") ("spark.streaming.backpressure.initialRate", "100") Source: Rate Source with following options. rowsPerSecond=10 Sink: Console Sink. I am expecting the process rate to limit at 100 Rows per Second, but maxRate is ignored and streaming job is processing at 10 rate. was: I am calling spark-submit passing maxRate, I have a single kinesis receiver, and batches of 1s spark-submit --conf spark.streaming.receiver.maxRate=10 however a single batch can greatly exceed the stablished maxRate. i.e: Im getting 300 records. it looks like Kinesis is completely ignoring the spark.streaming.receiver.maxRate configuration. If you look inside KinesisReceiver.onStart, you see: val kinesisClientLibConfiguration = new KinesisClientLibConfiguration(checkpointAppName, streamName, awsCredProvider, workerId) .withKinesisEndpoint(endpointUrl) .withInitialPositionInStream(initialPositionInStream) .withTaskBackoffTimeMillis(500) .withRegionName(regionName) This constructor ends up calling another constructor which has a lot of default values for the configuration. One of those values is DEFAULT_MAX_RECORDS which is constantly set to 10,000 records. > Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated > --- > > Key: SPARK-23294 > URL: https://issues.apache.org/jira/browse/SPARK-23294 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.1 >Reporter: Ravinder Matte >Assignee: Takeshi Yamamuro >Priority: Minor > Labels: ConsoleSink, RateSource > > Using following configs while building SparkSession > ("spark.streaming.backpressure.enabled", "true") > ("spark.streaming.receiver.maxRate", "100") > ("spark.streaming.backpressure.initialRate", "100") > > Source: Rate Source with following options. > rowsPerSecond=10 > Sink: Console Sink. > > I am expecting the process rate to limit at 100 Rows per Second, but maxRate > is ignored and streaming job is processing at 10 rate. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23200) Reset configuration when restarting from checkpoints
[ https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-23200: Fix Version/s: (was: 2.4.0) > Reset configuration when restarting from checkpoints > > > Key: SPARK-23200 > URL: https://issues.apache.org/jira/browse/SPARK-23200 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Major > > Streaming workloads and restarting from checkpoints may need additional > changes, i.e. resetting properties - see > https://github.com/apache-spark-on-k8s/spark/pull/516 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23294) Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated
[ https://issues.apache.org/jira/browse/SPARK-23294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravinder Matte updated SPARK-23294: --- Labels: ConsoleSink RateSource (was: kinesis) > Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated > --- > > Key: SPARK-23294 > URL: https://issues.apache.org/jira/browse/SPARK-23294 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.1 >Reporter: Ravinder Matte >Assignee: Takeshi Yamamuro >Priority: Minor > Labels: ConsoleSink, RateSource > > I am calling spark-submit passing maxRate, I have a single kinesis receiver, > and batches of 1s > spark-submit --conf spark.streaming.receiver.maxRate=10 > however a single batch can greatly exceed the stablished maxRate. i.e: Im > getting 300 records. > it looks like Kinesis is completely ignoring the > spark.streaming.receiver.maxRate configuration. > If you look inside KinesisReceiver.onStart, you see: > val kinesisClientLibConfiguration = > new KinesisClientLibConfiguration(checkpointAppName, streamName, > awsCredProvider, workerId) > .withKinesisEndpoint(endpointUrl) > .withInitialPositionInStream(initialPositionInStream) > .withTaskBackoffTimeMillis(500) > .withRegionName(regionName) > This constructor ends up calling another constructor which has a lot of > default values for the configuration. One of those values is > DEFAULT_MAX_RECORDS which is constantly set to 10,000 records. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23294) Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated
[ https://issues.apache.org/jira/browse/SPARK-23294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravinder Matte updated SPARK-23294: --- Fix Version/s: (was: 2.2.0) > Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated > --- > > Key: SPARK-23294 > URL: https://issues.apache.org/jira/browse/SPARK-23294 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.1 >Reporter: Ravinder Matte >Assignee: Takeshi Yamamuro >Priority: Minor > Labels: ConsoleSink, RateSource > > I am calling spark-submit passing maxRate, I have a single kinesis receiver, > and batches of 1s > spark-submit --conf spark.streaming.receiver.maxRate=10 > however a single batch can greatly exceed the stablished maxRate. i.e: Im > getting 300 records. > it looks like Kinesis is completely ignoring the > spark.streaming.receiver.maxRate configuration. > If you look inside KinesisReceiver.onStart, you see: > val kinesisClientLibConfiguration = > new KinesisClientLibConfiguration(checkpointAppName, streamName, > awsCredProvider, workerId) > .withKinesisEndpoint(endpointUrl) > .withInitialPositionInStream(initialPositionInStream) > .withTaskBackoffTimeMillis(500) > .withRegionName(regionName) > This constructor ends up calling another constructor which has a lot of > default values for the configuration. One of those values is > DEFAULT_MAX_RECORDS which is constantly set to 10,000 records. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-23200) Reset configuration when restarting from checkpoints
[ https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao reopened SPARK-23200: - Assignee: (was: Santiago Saavedra) > Reset configuration when restarting from checkpoints > > > Key: SPARK-23200 > URL: https://issues.apache.org/jira/browse/SPARK-23200 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Major > > Streaming workloads and restarting from checkpoints may need additional > changes, i.e. resetting properties - see > https://github.com/apache-spark-on-k8s/spark/pull/516 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23294) Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated
[ https://issues.apache.org/jira/browse/SPARK-23294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravinder Matte updated SPARK-23294: --- Component/s: (was: DStreams) Structured Streaming > Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated > --- > > Key: SPARK-23294 > URL: https://issues.apache.org/jira/browse/SPARK-23294 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.1 >Reporter: Ravinder Matte >Assignee: Takeshi Yamamuro >Priority: Minor > Labels: ConsoleSink, RateSource > > I am calling spark-submit passing maxRate, I have a single kinesis receiver, > and batches of 1s > spark-submit --conf spark.streaming.receiver.maxRate=10 > however a single batch can greatly exceed the stablished maxRate. i.e: Im > getting 300 records. > it looks like Kinesis is completely ignoring the > spark.streaming.receiver.maxRate configuration. > If you look inside KinesisReceiver.onStart, you see: > val kinesisClientLibConfiguration = > new KinesisClientLibConfiguration(checkpointAppName, streamName, > awsCredProvider, workerId) > .withKinesisEndpoint(endpointUrl) > .withInitialPositionInStream(initialPositionInStream) > .withTaskBackoffTimeMillis(500) > .withRegionName(regionName) > This constructor ends up calling another constructor which has a lot of > default values for the configuration. One of those values is > DEFAULT_MAX_RECORDS which is constantly set to 10,000 records. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23294) Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated
[ https://issues.apache.org/jira/browse/SPARK-23294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravinder Matte updated SPARK-23294: --- Affects Version/s: (was: 2.0.2) 2.2.1 > Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated > --- > > Key: SPARK-23294 > URL: https://issues.apache.org/jira/browse/SPARK-23294 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.1 >Reporter: Ravinder Matte >Assignee: Takeshi Yamamuro >Priority: Minor > Labels: kinesis > Fix For: 2.2.0 > > > I am calling spark-submit passing maxRate, I have a single kinesis receiver, > and batches of 1s > spark-submit --conf spark.streaming.receiver.maxRate=10 > however a single batch can greatly exceed the stablished maxRate. i.e: Im > getting 300 records. > it looks like Kinesis is completely ignoring the > spark.streaming.receiver.maxRate configuration. > If you look inside KinesisReceiver.onStart, you see: > val kinesisClientLibConfiguration = > new KinesisClientLibConfiguration(checkpointAppName, streamName, > awsCredProvider, workerId) > .withKinesisEndpoint(endpointUrl) > .withInitialPositionInStream(initialPositionInStream) > .withTaskBackoffTimeMillis(500) > .withRegionName(regionName) > This constructor ends up calling another constructor which has a lot of > default values for the configuration. One of those values is > DEFAULT_MAX_RECORDS which is constantly set to 10,000 records. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23294) Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated
Ravinder Matte created SPARK-23294: -- Summary: Spark Streaming + Rate source + Console Sink : Receiver MaxRate is violated Key: SPARK-23294 URL: https://issues.apache.org/jira/browse/SPARK-23294 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 2.0.2 Reporter: Ravinder Matte Assignee: Takeshi Yamamuro Fix For: 2.2.0 I am calling spark-submit passing maxRate, I have a single kinesis receiver, and batches of 1s spark-submit --conf spark.streaming.receiver.maxRate=10 however a single batch can greatly exceed the stablished maxRate. i.e: Im getting 300 records. it looks like Kinesis is completely ignoring the spark.streaming.receiver.maxRate configuration. If you look inside KinesisReceiver.onStart, you see: val kinesisClientLibConfiguration = new KinesisClientLibConfiguration(checkpointAppName, streamName, awsCredProvider, workerId) .withKinesisEndpoint(endpointUrl) .withInitialPositionInStream(initialPositionInStream) .withTaskBackoffTimeMillis(500) .withRegionName(regionName) This constructor ends up calling another constructor which has a lot of default values for the configuration. One of those values is DEFAULT_MAX_RECORDS which is constantly set to 10,000 records. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1
[ https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-23291: - Shepherd: Felix Cheung (was: Hossein Falaki) > SparkR : substr : In SparkR dataframe , starting and ending position > arguments in "substr" is giving wrong result when the position is greater > than 1 > -- > > Key: SPARK-23291 > URL: https://issues.apache.org/jira/browse/SPARK-23291 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.1 >Reporter: Narendra >Priority: Major > > Defect Description : > - > For example ,an input string "2017-12-01" is read into a SparkR dataframe > "df" with column name "col1". > The target is to create a a new column named "col2" with the value "12" > which is inside the string ."12" can be extracted with "starting position" as > "6" and "Ending position" as "7" > (the starting position of the first character is considered as "1" ) > But,the current code that needs to be written is : > > df <- withColumn(df,"col2",substr(df$col1,7,8))) > Observe that the first argument in the "substr" API , which indicates the > 'starting position', is mentioned as "7" > Also, observe that the second argument in the "substr" API , which indicates > the 'ending position', is mentioned as "8" > i.e the number that should be mentioned to indicate the position should be > the "actual position + 1" > Expected behavior : > > The code that needs to be written is : > > df <- withColumn(df,"col2",substr(df$col1,6,7))) > Note : > --- > This defect is observed with only when the starting position is greater than > 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22274) User-defined aggregation functions with pandas udf
[ https://issues.apache.org/jira/browse/SPARK-22274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348033#comment-16348033 ] Apache Spark commented on SPARK-22274: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/20467 > User-defined aggregation functions with pandas udf > -- > > Key: SPARK-22274 > URL: https://issues.apache.org/jira/browse/SPARK-22274 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Li Jin >Assignee: Li Jin >Priority: Major > Fix For: 2.4.0 > > > This function doesn't implement partial aggregation and shuffles all data. A > uadf that supports partial aggregation is not covered by this Jira. > Exmaple: > {code:java} > @pandas_udf(DoubleType()) > def mean(v) > return v.mean() > df.groupby('id').apply(mean(df.v1), mean(df.v2)) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23293) data source v2 self join fails
[ https://issues.apache.org/jira/browse/SPARK-23293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348020#comment-16348020 ] Apache Spark commented on SPARK-23293: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/20466 > data source v2 self join fails > -- > > Key: SPARK-23293 > URL: https://issues.apache.org/jira/browse/SPARK-23293 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23293) data source v2 self join fails
[ https://issues.apache.org/jira/browse/SPARK-23293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23293: Assignee: Wenchen Fan (was: Apache Spark) > data source v2 self join fails > -- > > Key: SPARK-23293 > URL: https://issues.apache.org/jira/browse/SPARK-23293 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23293) data source v2 self join fails
[ https://issues.apache.org/jira/browse/SPARK-23293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23293: Assignee: Apache Spark (was: Wenchen Fan) > data source v2 self join fails > -- > > Key: SPARK-23293 > URL: https://issues.apache.org/jira/browse/SPARK-23293 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23293) data source v2 self join fails
Wenchen Fan created SPARK-23293: --- Summary: data source v2 self join fails Key: SPARK-23293 URL: https://issues.apache.org/jira/browse/SPARK-23293 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23188) Make vectorized columar reader batch size configurable
[ https://issues.apache.org/jira/browse/SPARK-23188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23188. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20361 [https://github.com/apache/spark/pull/20361] > Make vectorized columar reader batch size configurable > -- > > Key: SPARK-23188 > URL: https://issues.apache.org/jira/browse/SPARK-23188 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23188) Make vectorized columar reader batch size configurable
[ https://issues.apache.org/jira/browse/SPARK-23188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23188: --- Assignee: Jiang Xingbo > Make vectorized columar reader batch size configurable > -- > > Key: SPARK-23188 > URL: https://issues.apache.org/jira/browse/SPARK-23188 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23292) python tests related to pandas are skipped
[ https://issues.apache.org/jira/browse/SPARK-23292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348008#comment-16348008 ] Apache Spark commented on SPARK-23292: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/20465 > python tests related to pandas are skipped > -- > > Key: SPARK-23292 > URL: https://issues.apache.org/jira/browse/SPARK-23292 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 >Reporter: Yin Huai >Priority: Blocker > > I was running python tests and found that > [pyspark.sql.tests.GroupbyAggPandasUDFTests.test_unsupported_types|https://github.com/apache/spark/blob/52e00f70663a87b5837235bdf72a3e6f84e11411/python/pyspark/sql/tests.py#L4528-L4548] > does not run with Python 2 because the test uses "assertRaisesRegex" > (supported by Python 3) instead of "assertRaisesRegexp" (supported by Python > 2). However, spark jenkins does not fail because of this issue (see run > history at > [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/]). > After looking into this issue, [seems test script will skip tests related to > pandas if pandas is not > installed|https://github.com/apache/spark/blob/2ac895be909de7e58e1051dc2a1bba98a25bf4be/python/pyspark/sql/tests.py#L51-L63], > which means that jenkins does not have pandas installed. > > Since pyarrow related tests have the same skipping logic, we will need to > check if jenkins has pyarrow installed correctly as well. > > Since features using pandas and pyarrow are in 2.3, we should fix the test > issue and make sure all tests pass before we make the release. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23292) python tests related to pandas are skipped
[ https://issues.apache.org/jira/browse/SPARK-23292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23292: Assignee: (was: Apache Spark) > python tests related to pandas are skipped > -- > > Key: SPARK-23292 > URL: https://issues.apache.org/jira/browse/SPARK-23292 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 >Reporter: Yin Huai >Priority: Blocker > > I was running python tests and found that > [pyspark.sql.tests.GroupbyAggPandasUDFTests.test_unsupported_types|https://github.com/apache/spark/blob/52e00f70663a87b5837235bdf72a3e6f84e11411/python/pyspark/sql/tests.py#L4528-L4548] > does not run with Python 2 because the test uses "assertRaisesRegex" > (supported by Python 3) instead of "assertRaisesRegexp" (supported by Python > 2). However, spark jenkins does not fail because of this issue (see run > history at > [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/]). > After looking into this issue, [seems test script will skip tests related to > pandas if pandas is not > installed|https://github.com/apache/spark/blob/2ac895be909de7e58e1051dc2a1bba98a25bf4be/python/pyspark/sql/tests.py#L51-L63], > which means that jenkins does not have pandas installed. > > Since pyarrow related tests have the same skipping logic, we will need to > check if jenkins has pyarrow installed correctly as well. > > Since features using pandas and pyarrow are in 2.3, we should fix the test > issue and make sure all tests pass before we make the release. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23292) python tests related to pandas are skipped
[ https://issues.apache.org/jira/browse/SPARK-23292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23292: Assignee: Apache Spark > python tests related to pandas are skipped > -- > > Key: SPARK-23292 > URL: https://issues.apache.org/jira/browse/SPARK-23292 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Blocker > > I was running python tests and found that > [pyspark.sql.tests.GroupbyAggPandasUDFTests.test_unsupported_types|https://github.com/apache/spark/blob/52e00f70663a87b5837235bdf72a3e6f84e11411/python/pyspark/sql/tests.py#L4528-L4548] > does not run with Python 2 because the test uses "assertRaisesRegex" > (supported by Python 3) instead of "assertRaisesRegexp" (supported by Python > 2). However, spark jenkins does not fail because of this issue (see run > history at > [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/]). > After looking into this issue, [seems test script will skip tests related to > pandas if pandas is not > installed|https://github.com/apache/spark/blob/2ac895be909de7e58e1051dc2a1bba98a25bf4be/python/pyspark/sql/tests.py#L51-L63], > which means that jenkins does not have pandas installed. > > Since pyarrow related tests have the same skipping logic, we will need to > check if jenkins has pyarrow installed correctly as well. > > Since features using pandas and pyarrow are in 2.3, we should fix the test > issue and make sure all tests pass before we make the release. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10884) Support prediction on single instance for regression and classification related models
[ https://issues.apache.org/jira/browse/SPARK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348009#comment-16348009 ] Weichen Xu commented on SPARK-10884: [~mingma] sure. But need to wait until spark 2.3 released. [~yanboliang] > Support prediction on single instance for regression and classification > related models > -- > > Key: SPARK-10884 > URL: https://issues.apache.org/jira/browse/SPARK-10884 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Major > Labels: 2.2.0 > > Support prediction on single instance for regression and classification > related models (i.e., PredictionModel, ClassificationModel and their sub > classes). > Add corresponding test cases. > See parent issue for more details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23281) Query produces results in incorrect order when a composite order by clause refers to both original columns and aliases
[ https://issues.apache.org/jira/browse/SPARK-23281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-23281: --- Assignee: Dilip Biswal > Query produces results in incorrect order when a composite order by clause > refers to both original columns and aliases > -- > > Key: SPARK-23281 > URL: https://issues.apache.org/jira/browse/SPARK-23281 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dilip Biswal >Assignee: Dilip Biswal >Priority: Major > Fix For: 2.2.2, 2.3.0 > > > Here is the test snippet. > {code} > scala> Seq[(Integer, Integer)]( > | (1, 1), > | (1, 3), > | (2, 3), > | (3, 3), > | (4, null), > | (5, null) > | ).toDF("key", "value").createOrReplaceTempView("src") > scala> sql( > | """ > | |SELECT MAX(value) as value, key as col2 > | |FROM src > | |GROUP BY key > | |ORDER BY value desc, key > | """.stripMargin).show > +-++ > |value|col2| > +-++ > |3| 3| > |3| 2| > |3| 1| > | null| 5| > | null| 4| > +-++ > {code} > Here is the explain output : > {code} > == Parsed Logical Plan == > 'Sort ['value DESC NULLS LAST, 'key ASC NULLS FIRST], true > +- 'Aggregate ['key], ['MAX('value) AS value#9, 'key AS col2#10] >+- 'UnresolvedRelation `src` > == Analyzed Logical Plan == > value: int, col2: int > Project [value#9, col2#10] > +- Sort [value#9 DESC NULLS LAST, col2#10 DESC NULLS LAST], true >+- Aggregate [key#5], [max(value#6) AS value#9, key#5 AS col2#10] > +- SubqueryAlias src > +- Project [_1#2 AS key#5, _2#3 AS value#6] > +- LocalRelation [_1#2, _2#3] > {code} > The sort direction should be ascending for the 2nd column. Instead its being > changed > to descending in Analyzer.resolveAggregateFunctions. > The above testcase models TPCDS-Q71 and thus we have the same issue in Q71 as > well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23281) Query produces results in incorrect order when a composite order by clause refers to both original columns and aliases
[ https://issues.apache.org/jira/browse/SPARK-23281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23281. - Resolution: Fixed Fix Version/s: 2.3.0 2.2.2 > Query produces results in incorrect order when a composite order by clause > refers to both original columns and aliases > -- > > Key: SPARK-23281 > URL: https://issues.apache.org/jira/browse/SPARK-23281 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dilip Biswal >Priority: Major > Fix For: 2.2.2, 2.3.0 > > > Here is the test snippet. > {code} > scala> Seq[(Integer, Integer)]( > | (1, 1), > | (1, 3), > | (2, 3), > | (3, 3), > | (4, null), > | (5, null) > | ).toDF("key", "value").createOrReplaceTempView("src") > scala> sql( > | """ > | |SELECT MAX(value) as value, key as col2 > | |FROM src > | |GROUP BY key > | |ORDER BY value desc, key > | """.stripMargin).show > +-++ > |value|col2| > +-++ > |3| 3| > |3| 2| > |3| 1| > | null| 5| > | null| 4| > +-++ > {code} > Here is the explain output : > {code} > == Parsed Logical Plan == > 'Sort ['value DESC NULLS LAST, 'key ASC NULLS FIRST], true > +- 'Aggregate ['key], ['MAX('value) AS value#9, 'key AS col2#10] >+- 'UnresolvedRelation `src` > == Analyzed Logical Plan == > value: int, col2: int > Project [value#9, col2#10] > +- Sort [value#9 DESC NULLS LAST, col2#10 DESC NULLS LAST], true >+- Aggregate [key#5], [max(value#6) AS value#9, key#5 AS col2#10] > +- SubqueryAlias src > +- Project [_1#2 AS key#5, _2#3 AS value#6] > +- LocalRelation [_1#2, _2#3] > {code} > The sort direction should be ascending for the 2nd column. Instead its being > changed > to descending in Analyzer.resolveAggregateFunctions. > The above testcase models TPCDS-Q71 and thus we have the same issue in Q71 as > well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21396) Spark Hive Thriftserver doesn't return UDT field
[ https://issues.apache.org/jira/browse/SPARK-21396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-21396. - Resolution: Fixed Fix Version/s: 2.3.0 > Spark Hive Thriftserver doesn't return UDT field > > > Key: SPARK-21396 > URL: https://issues.apache.org/jira/browse/SPARK-21396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Haopu Wang >Assignee: Ken Tore Tallakstad >Priority: Major > Labels: Hive, ThriftServer2, user-defined-type > Fix For: 2.3.0 > > > I want to query a table with a MLLib Vector field and get below exception. > Can Spark Hive Thriftserver be enhanced to return UDT field? > == > 2017-07-13 13:14:25,435 WARN > [org.apache.hive.service.cli.thrift.ThriftCLIService] > (HiveServer2-Handler-Pool: Thread-18537;) Error fetching results: > java.lang.RuntimeException: scala.MatchError: > org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 (of class > org.apache.spark.ml.linalg.VectorUDT) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) > at com.sun.proxy.$Proxy29.fetchResults(Unknown Source) > at > org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: scala.MatchError: org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 > (of class org.apache.spark.ml.linalg.VectorUDT) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(SparkExecuteStatementOperation.scala:80) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:144) > at > org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220) > at > org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:685) > at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) > ... 18 more -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21396) Spark Hive Thriftserver doesn't return UDT field
[ https://issues.apache.org/jira/browse/SPARK-21396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-21396: --- Assignee: Ken Tore Tallakstad > Spark Hive Thriftserver doesn't return UDT field > > > Key: SPARK-21396 > URL: https://issues.apache.org/jira/browse/SPARK-21396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Haopu Wang >Assignee: Ken Tore Tallakstad >Priority: Major > Labels: Hive, ThriftServer2, user-defined-type > Fix For: 2.3.0 > > > I want to query a table with a MLLib Vector field and get below exception. > Can Spark Hive Thriftserver be enhanced to return UDT field? > == > 2017-07-13 13:14:25,435 WARN > [org.apache.hive.service.cli.thrift.ThriftCLIService] > (HiveServer2-Handler-Pool: Thread-18537;) Error fetching results: > java.lang.RuntimeException: scala.MatchError: > org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 (of class > org.apache.spark.ml.linalg.VectorUDT) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) > at com.sun.proxy.$Proxy29.fetchResults(Unknown Source) > at > org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: scala.MatchError: org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 > (of class org.apache.spark.ml.linalg.VectorUDT) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(SparkExecuteStatementOperation.scala:80) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:144) > at > org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220) > at > org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:685) > at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) > ... 18 more -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23268) Reorganize packages in data source V2
[ https://issues.apache.org/jira/browse/SPARK-23268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23268. - Resolution: Fixed Assignee: Gengliang Wang Fix Version/s: 2.3.0 > Reorganize packages in data source V2 > - > > Key: SPARK-23268 > URL: https://issues.apache.org/jira/browse/SPARK-23268 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.1 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.3.0 > > > 1. create a new package for partitioning/distribution related classes > 2. move streaming related class to package > org.apache.spark.sql.sources.v2.reader/writer.streaming, instead of > org.apache.spark.sql.sources.v2.streaming.reader/writer -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23292) python tests related to pandas are skipped
Yin Huai created SPARK-23292: Summary: python tests related to pandas are skipped Key: SPARK-23292 URL: https://issues.apache.org/jira/browse/SPARK-23292 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.3.0 Reporter: Yin Huai I was running python tests and found that [pyspark.sql.tests.GroupbyAggPandasUDFTests.test_unsupported_types|https://github.com/apache/spark/blob/52e00f70663a87b5837235bdf72a3e6f84e11411/python/pyspark/sql/tests.py#L4528-L4548] does not run with Python 2 because the test uses "assertRaisesRegex" (supported by Python 3) instead of "assertRaisesRegexp" (supported by Python 2). However, spark jenkins does not fail because of this issue (see run history at [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/]). After looking into this issue, [seems test script will skip tests related to pandas if pandas is not installed|https://github.com/apache/spark/blob/2ac895be909de7e58e1051dc2a1bba98a25bf4be/python/pyspark/sql/tests.py#L51-L63], which means that jenkins does not have pandas installed. Since pyarrow related tests have the same skipping logic, we will need to check if jenkins has pyarrow installed correctly as well. Since features using pandas and pyarrow are in 2.3, we should fix the test issue and make sure all tests pass before we make the release. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23247) combines Unsafe operations and statistics operations in Scan Data Source
[ https://issues.apache.org/jira/browse/SPARK-23247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23247. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20415 [https://github.com/apache/spark/pull/20415] > combines Unsafe operations and statistics operations in Scan Data Source > > > Key: SPARK-23247 > URL: https://issues.apache.org/jira/browse/SPARK-23247 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: caoxuewen >Assignee: caoxuewen >Priority: Major > Fix For: 2.4.0 > > > Currently, we scan the execution plan of the data source, first the unsafe > operation of each row of data, and then re traverse the data for the count of > rows. In terms of performance, this is not necessary. this PR combines the > two operations and makes statistics on the number of rows while performing > the unsafe operation. > *Before modified,* > {color:#cc7832}val {color}unsafeRow = rdd.mapPartitionsWithIndexInternal { > (index{color:#cc7832}, {color}iter) => > {color:#cc7832}val {color}proj = > UnsafeProjection.create({color:#9876aa}schema{color}) > proj.initialize(index) > {color:#FF}iter.map(proj){color} > } > {color:#cc7832}val {color}numOutputRows = > longMetric({color:#6a8759}"numOutputRows"{color}) > unsafeRow.map { r => > {color:#FF}numOutputRows += {color}{color:#6897bb}{color:#FF}1{color} > {color} r > } > *After modified,* > val numOutputRows = longMetric("numOutputRows") > rdd.mapPartitionsWithIndexInternal { (index, iter) => > val proj = UnsafeProjection.create(schema) > proj.initialize(index) > iter.map( r => { > {color:#FF} numOutputRows += 1{color} > {color:#FF} proj(r){color} > }) > } > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23247) combines Unsafe operations and statistics operations in Scan Data Source
[ https://issues.apache.org/jira/browse/SPARK-23247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23247: --- Assignee: caoxuewen > combines Unsafe operations and statistics operations in Scan Data Source > > > Key: SPARK-23247 > URL: https://issues.apache.org/jira/browse/SPARK-23247 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: caoxuewen >Assignee: caoxuewen >Priority: Major > > Currently, we scan the execution plan of the data source, first the unsafe > operation of each row of data, and then re traverse the data for the count of > rows. In terms of performance, this is not necessary. this PR combines the > two operations and makes statistics on the number of rows while performing > the unsafe operation. > *Before modified,* > {color:#cc7832}val {color}unsafeRow = rdd.mapPartitionsWithIndexInternal { > (index{color:#cc7832}, {color}iter) => > {color:#cc7832}val {color}proj = > UnsafeProjection.create({color:#9876aa}schema{color}) > proj.initialize(index) > {color:#FF}iter.map(proj){color} > } > {color:#cc7832}val {color}numOutputRows = > longMetric({color:#6a8759}"numOutputRows"{color}) > unsafeRow.map { r => > {color:#FF}numOutputRows += {color}{color:#6897bb}{color:#FF}1{color} > {color} r > } > *After modified,* > val numOutputRows = longMetric("numOutputRows") > rdd.mapPartitionsWithIndexInternal { (index, iter) => > val proj = UnsafeProjection.create(schema) > proj.initialize(index) > iter.map( r => { > {color:#FF} numOutputRows += 1{color} > {color:#FF} proj(r){color} > }) > } > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347982#comment-16347982 ] Gaurav Garg commented on SPARK-18016: - [~kiszk], I have Spark 2.2.0 environment, will have to co-ordinate admin team for update, but I have tried with updated jar dependencies in my code. Is that not fine ? > Code Generation: Constant Pool Past Limit for Wide/Nested Dataset > - > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Aleksander Eskilson >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.3.0 > > Attachments: 910825_9.zip > > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) > at
[jira] [Resolved] (SPARK-23280) add map type support to ColumnVector
[ https://issues.apache.org/jira/browse/SPARK-23280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23280. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20450 [https://github.com/apache/spark/pull/20450] > add map type support to ColumnVector > > > Key: SPARK-23280 > URL: https://issues.apache.org/jira/browse/SPARK-23280 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1
[ https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347967#comment-16347967 ] Apache Spark commented on SPARK-23291: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/20464 > SparkR : substr : In SparkR dataframe , starting and ending position > arguments in "substr" is giving wrong result when the position is greater > than 1 > -- > > Key: SPARK-23291 > URL: https://issues.apache.org/jira/browse/SPARK-23291 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.1 >Reporter: Narendra >Priority: Major > > Defect Description : > - > For example ,an input string "2017-12-01" is read into a SparkR dataframe > "df" with column name "col1". > The target is to create a a new column named "col2" with the value "12" > which is inside the string ."12" can be extracted with "starting position" as > "6" and "Ending position" as "7" > (the starting position of the first character is considered as "1" ) > But,the current code that needs to be written is : > > df <- withColumn(df,"col2",substr(df$col1,7,8))) > Observe that the first argument in the "substr" API , which indicates the > 'starting position', is mentioned as "7" > Also, observe that the second argument in the "substr" API , which indicates > the 'ending position', is mentioned as "8" > i.e the number that should be mentioned to indicate the position should be > the "actual position + 1" > Expected behavior : > > The code that needs to be written is : > > df <- withColumn(df,"col2",substr(df$col1,6,7))) > Note : > --- > This defect is observed with only when the starting position is greater than > 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1
[ https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23291: Assignee: (was: Apache Spark) > SparkR : substr : In SparkR dataframe , starting and ending position > arguments in "substr" is giving wrong result when the position is greater > than 1 > -- > > Key: SPARK-23291 > URL: https://issues.apache.org/jira/browse/SPARK-23291 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.1 >Reporter: Narendra >Priority: Major > > Defect Description : > - > For example ,an input string "2017-12-01" is read into a SparkR dataframe > "df" with column name "col1". > The target is to create a a new column named "col2" with the value "12" > which is inside the string ."12" can be extracted with "starting position" as > "6" and "Ending position" as "7" > (the starting position of the first character is considered as "1" ) > But,the current code that needs to be written is : > > df <- withColumn(df,"col2",substr(df$col1,7,8))) > Observe that the first argument in the "substr" API , which indicates the > 'starting position', is mentioned as "7" > Also, observe that the second argument in the "substr" API , which indicates > the 'ending position', is mentioned as "8" > i.e the number that should be mentioned to indicate the position should be > the "actual position + 1" > Expected behavior : > > The code that needs to be written is : > > df <- withColumn(df,"col2",substr(df$col1,6,7))) > Note : > --- > This defect is observed with only when the starting position is greater than > 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1
[ https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23291: Assignee: Apache Spark > SparkR : substr : In SparkR dataframe , starting and ending position > arguments in "substr" is giving wrong result when the position is greater > than 1 > -- > > Key: SPARK-23291 > URL: https://issues.apache.org/jira/browse/SPARK-23291 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.1 >Reporter: Narendra >Assignee: Apache Spark >Priority: Major > > Defect Description : > - > For example ,an input string "2017-12-01" is read into a SparkR dataframe > "df" with column name "col1". > The target is to create a a new column named "col2" with the value "12" > which is inside the string ."12" can be extracted with "starting position" as > "6" and "Ending position" as "7" > (the starting position of the first character is considered as "1" ) > But,the current code that needs to be written is : > > df <- withColumn(df,"col2",substr(df$col1,7,8))) > Observe that the first argument in the "substr" API , which indicates the > 'starting position', is mentioned as "7" > Also, observe that the second argument in the "substr" API , which indicates > the 'ending position', is mentioned as "8" > i.e the number that should be mentioned to indicate the position should be > the "actual position + 1" > Expected behavior : > > The code that needs to be written is : > > df <- withColumn(df,"col2",substr(df$col1,6,7))) > Note : > --- > This defect is observed with only when the starting position is greater than > 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version
[ https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347922#comment-16347922 ] Dongjoon Hyun commented on SPARK-14492: --- Hi, [~smilegator]. According to the doc and code, do we officially support old HMS like 0.14.0? Today, I tried `--conf spark.hadoop.hive.metastore.uris=thrift://xxx:9083 --conf spark.sql.hive.metastore.version=0.14.0 --conf spark.sql.hive.metastore.jars=/xxx/yyy/*`, but I met the following errors. {code} 18/02/01 02:07:00 WARN hive.metastore: set_ugi() not successful, Likely cause: new client talking to old server. Continuing without it. ... org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) {code} > Spark SQL 1.6.0 does not work with external Hive metastore version lower than > 1.2.0; its not backwards compatible with earlier version > -- > > Key: SPARK-14492 > URL: https://issues.apache.org/jira/browse/SPARK-14492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunil Rangwani >Priority: Critical > > Spark SQL when configured with a Hive version lower than 1.2.0 throws a > java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME > because this field was introduced in Hive 1.2.0 so its not possible to use > Hive metastore version lower than 1.2.0 with Spark. The details of the Hive > changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 > {code:java} > Exception in thread "main" java.lang.NoSuchFieldError: > METASTORE_CLIENT_SOCKET_LIFETIME > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:271) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet
[ https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-23157: -- > withColumn fails for a column that is a result of mapped DataSet > > > Key: SPARK-23157 > URL: https://issues.apache.org/jira/browse/SPARK-23157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Tomasz Bartczak >Priority: Minor > Fix For: 2.3.0 > > > Having > {code:java} > case class R(id: String) > val ds = spark.createDataset(Seq(R("1"))) > {code} > This works: > {code} > scala> ds.withColumn("n", ds.col("id")) > res16: org.apache.spark.sql.DataFrame = [id: string, n: string] > {code} > but when we map over ds it fails: > {code} > scala> ds.withColumn("n", ds.map(a => a).col("id")) > org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing > from id#4 in operator !Project [id#4, id#55 AS n#57];; > !Project [id#4, id#55 AS n#57] > +- LocalRelation [id#4] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1150) > at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet
[ https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-23157: Assignee: Henry Robinson > withColumn fails for a column that is a result of mapped DataSet > > > Key: SPARK-23157 > URL: https://issues.apache.org/jira/browse/SPARK-23157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Tomasz Bartczak >Assignee: Henry Robinson >Priority: Minor > Fix For: 2.3.0 > > > Having > {code:java} > case class R(id: String) > val ds = spark.createDataset(Seq(R("1"))) > {code} > This works: > {code} > scala> ds.withColumn("n", ds.col("id")) > res16: org.apache.spark.sql.DataFrame = [id: string, n: string] > {code} > but when we map over ds it fails: > {code} > scala> ds.withColumn("n", ds.map(a => a).col("id")) > org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing > from id#4 in operator !Project [id#4, id#55 AS n#57];; > !Project [id#4, id#55 AS n#57] > +- LocalRelation [id#4] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1150) > at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet
[ https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23157. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20443 [https://github.com/apache/spark/pull/20443] > withColumn fails for a column that is a result of mapped DataSet > > > Key: SPARK-23157 > URL: https://issues.apache.org/jira/browse/SPARK-23157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Tomasz Bartczak >Assignee: Henry Robinson >Priority: Minor > Fix For: 2.3.0 > > > Having > {code:java} > case class R(id: String) > val ds = spark.createDataset(Seq(R("1"))) > {code} > This works: > {code} > scala> ds.withColumn("n", ds.col("id")) > res16: org.apache.spark.sql.DataFrame = [id: string, n: string] > {code} > but when we map over ds it fails: > {code} > scala> ds.withColumn("n", ds.map(a => a).col("id")) > org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing > from id#4 in operator !Project [id#4, id#55 AS n#57];; > !Project [id#4, id#55 AS n#57] > +- LocalRelation [id#4] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1150) > at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10884) Support prediction on single instance for regression and classification related models
[ https://issues.apache.org/jira/browse/SPARK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347881#comment-16347881 ] Ming Ma commented on SPARK-10884: - Thanks [~WeichenXu123] and [~yanboliang]. While the long-term goal is to support "Pipeline for single instance" functionality, this specific patch is still quite useful. Any chance we can get it into the master branch soon? > Support prediction on single instance for regression and classification > related models > -- > > Key: SPARK-10884 > URL: https://issues.apache.org/jira/browse/SPARK-10884 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Major > Labels: 2.2.0 > > Support prediction on single instance for regression and classification > related models (i.e., PredictionModel, ClassificationModel and their sub > classes). > Add corresponding test cases. > See parent issue for more details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1
Narendra created SPARK-23291: Summary: SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1 Key: SPARK-23291 URL: https://issues.apache.org/jira/browse/SPARK-23291 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.1 Reporter: Narendra Defect Description : - For example ,an input string "2017-12-01" is read into a SparkR dataframe "df" with column name "col1". The target is to create a a new column named "col2" with the value "12" which is inside the string ."12" can be extracted with "starting position" as "6" and "Ending position" as "7" (the starting position of the first character is considered as "1" ) But,the current code that needs to be written is : df <- withColumn(df,"col2",substr(df$col1,7,8))) Observe that the first argument in the "substr" API , which indicates the 'starting position', is mentioned as "7" Also, observe that the second argument in the "substr" API , which indicates the 'ending position', is mentioned as "8" i.e the number that should be mentioned to indicate the position should be the "actual position + 1" Expected behavior : The code that needs to be written is : df <- withColumn(df,"col2",substr(df$col1,6,7))) Note : --- This defect is observed with only when the starting position is greater than 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23251) ClassNotFoundException: scala.Any when there's a missing implicit Map encoder
[ https://issues.apache.org/jira/browse/SPARK-23251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347808#comment-16347808 ] Michal Šenkýř commented on SPARK-23251: --- Yes, this does seem like the Map encoder is not checking whether appropriate encoders exist for the key and value types. I think I couldn't get the compiler to resolve it if I added the appropriate typeclass checks and wanted to support subclasses of the collection type at the same time. I will check in the following few days whether that was the case and try to figure out some alternative. > ClassNotFoundException: scala.Any when there's a missing implicit Map encoder > - > > Key: SPARK-23251 > URL: https://issues.apache.org/jira/browse/SPARK-23251 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 > Environment: mac os high sierra, centos 7 >Reporter: Bruce Robbins >Priority: Minor > > In branch-2.2, when you attempt to use row.getValuesMap[Any] without an > implicit Map encoder, you get a nice descriptive compile-time error: > {noformat} > scala> df.map(row => row.getValuesMap[Any](List("stationName", > "year"))).collect > :26: error: Unable to find encoder for type stored in a Dataset. > Primitive types (Int, String, etc) and Product types (case classes) are > supported by importing spark.implicits._ Support for serializing other types > will be added in future releases. > df.map(row => row.getValuesMap[Any](List("stationName", > "year"))).collect > ^ > scala> implicit val mapEncoder = > org.apache.spark.sql.Encoders.kryo[Map[String, Any]] > mapEncoder: org.apache.spark.sql.Encoder[Map[String,Any]] = class[value[0]: > binary] > scala> df.map(row => row.getValuesMap[Any](List("stationName", > "year"))).collect > res1: Array[Map[String,Any]] = Array(Map(stationName -> 007026 9, year -> > 2014), Map(stationName -> 007026 9, year -> 2014), Map(stationName -> > 007026 9, year -> 2014), > etc... > {noformat} > > On the latest master and also on branch-2.3, the transformation compiles (at > least on spark-shell), but throws a ClassNotFoundException: > > {noformat} > scala> df.map(row => row.getValuesMap[Any](List("stationName", > "year"))).collect > java.lang.ClassNotFoundException: scala.Any > at > scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > scala.reflect.runtime.JavaMirrors$JavaMirror.javaClass(JavaMirrors.scala:555) > at > scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1211) > at > scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1203) > at > scala.reflect.runtime.TwoWayCaches$TwoWayCache$$anonfun$toJava$1.apply(TwoWayCaches.scala:49) > at scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19) > at scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16) > at > scala.reflect.runtime.TwoWayCaches$TwoWayCache.toJava(TwoWayCaches.scala:44) > at > scala.reflect.runtime.JavaMirrors$JavaMirror.classToJava(JavaMirrors.scala:1203) > at > scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:194) > at > scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:54) > at > org.apache.spark.sql.catalyst.ScalaReflection$.getClassFromType(ScalaReflection.scala:700) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:84) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:65) > at > scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824) > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) > at > org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor(ScalaReflection.scala:64) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:512) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445) > at >
[jira] [Created] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe
Andre Menck created SPARK-23290: --- Summary: inadvertent change in handling of DateType when converting to pandas dataframe Key: SPARK-23290 URL: https://issues.apache.org/jira/browse/SPARK-23290 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.0 Reporter: Andre Menck In [this PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968] there was a change in how `DateType` is being returned to users (line 1968 in dataframe.py). This can cause client code to fail, as in the following example from a python terminal: {code:python} >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num']) >>> pdf.dtypes dateobject num int64 dtype: object >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() ) 02015-01-01 Name: date, dtype: object >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num']) >>> pdf.dtypes dateobject num int64 dtype: object >>> pdf['date'] = pd.to_datetime(pdf['date']) >>> pdf.dtypes datedatetime64[ns] num int64 dtype: object >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() ) Traceback (most recent call last): File "", line 1, in File "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", line 2355, in apply mapped = lib.map_infer(values, f, convert=convert_dtype) File "pandas/_libs/src/inference.pyx", line 1574, in pandas._libs.lib.map_infer File "", line 1, in TypeError: strptime() argument 1 must be string, not Timestamp >>> {code} Above we show both the old behavior (returning an "object" col) and the new behavior (returning a datetime column). Since there may be user code relying on the old behavior, I'd suggest reverting this specific part of this change. Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" seems to be off, referring to the old behavior and not the current one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19209) "No suitable driver" on first try
[ https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347700#comment-16347700 ] Tony Xu edited comment on SPARK-19209 at 1/31/18 10:26 PM: --- This seems to be a forgotten issue but I'm still experiencing it in Spark 2.2.1 Could this issue be related to the Driver itself? For example, I tried using the MySQL JDBC driver and that seems to work fine on the first try. However, when I try using Snowflake's JDBC driver, I run into this exact issue. I'm not sure what the difference between these two drivers are but it might be worth digging into was (Author: txu0393): This seems like a forgotten issue but I'm still experiencing it in Spark 2.2.1 Could this issue be related to the Driver itself? For example, I tried using the MySQL JDBC driver and that seems to work fine on the first try. However, when I try using Snowflake's JDBC driver, I run into this exact issue. I'm not sure what the difference between these two drivers are but it might be worth digging into > "No suitable driver" on first try > - > > Key: SPARK-19209 > URL: https://issues.apache.org/jira/browse/SPARK-19209 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Daniel Darabos >Priority: Critical > > This is a regression from Spark 2.0.2. Observe! > {code} > $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar > --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar > [...] > scala> spark.read.format("jdbc").option("url", > "jdbc:sqlite:").option("dbtable", "x").load > java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such > table: x) > {code} > This is the "good" exception. Now with Spark 2.1.0: > {code} > $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar > --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar > [...] > scala> spark.read.format("jdbc").option("url", > "jdbc:sqlite:").option("dbtable", "x").load > java.sql.SQLException: No suitable driver > at java.sql.DriverManager.getDriver(DriverManager.java:315) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125) > ... 48 elided > scala> spark.read.format("jdbc").option("url", > "jdbc:sqlite:").option("dbtable", "x").load > java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such > table: x) > {code} > Simply re-executing the same command a second time "fixes" the {{No suitable > driver}} error. > My guess is this is fallout from https://github.com/apache/spark/pull/15292 > which changed the JDBC driver management code. But this code is so hard to > understand for me, I could be totally wrong. > This is nothing more than a nuisance for {{spark-shell}} usage, but it is > more painful to work around for applications. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19209) "No suitable driver" on first try
[ https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347700#comment-16347700 ] Tony Xu commented on SPARK-19209: - This seems like a forgotten issue but I'm still experiencing it in Spark 2.2.1 Could this issue be related to the Driver itself? For example, I tried using the MySQL JDBC driver and that seems to work fine on the first try. However, when I try using Snowflake's JDBC driver, I run into this exact issue. I'm not sure what the difference between these two drivers are but it might be worth digging into > "No suitable driver" on first try > - > > Key: SPARK-19209 > URL: https://issues.apache.org/jira/browse/SPARK-19209 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Daniel Darabos >Priority: Critical > > This is a regression from Spark 2.0.2. Observe! > {code} > $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar > --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar > [...] > scala> spark.read.format("jdbc").option("url", > "jdbc:sqlite:").option("dbtable", "x").load > java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such > table: x) > {code} > This is the "good" exception. Now with Spark 2.1.0: > {code} > $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar > --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar > [...] > scala> spark.read.format("jdbc").option("url", > "jdbc:sqlite:").option("dbtable", "x").load > java.sql.SQLException: No suitable driver > at java.sql.DriverManager.getDriver(DriverManager.java:315) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125) > ... 48 elided > scala> spark.read.format("jdbc").option("url", > "jdbc:sqlite:").option("dbtable", "x").load > java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such > table: x) > {code} > Simply re-executing the same command a second time "fixes" the {{No suitable > driver}} error. > My guess is this is fallout from https://github.com/apache/spark/pull/15292 > which changed the JDBC driver management code. But this code is so hard to > understand for me, I could be totally wrong. > This is nothing more than a nuisance for {{spark-shell}} usage, but it is > more painful to work around for applications. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
[ https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347684#comment-16347684 ] Apache Spark commented on SPARK-23020: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/20462 > Re-enable Flaky Test: > org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher > > > Key: SPARK-23020 > URL: https://issues.apache.org/jira/browse/SPARK-23020 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.0 >Reporter: Sameer Agarwal >Assignee: Marcelo Vanzin >Priority: Blocker > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23289) OneForOneBlockFetcher.DownloadCallback.onData may write just a part of data
[ https://issues.apache.org/jira/browse/SPARK-23289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23289: Assignee: Apache Spark (was: Shixiong Zhu) > OneForOneBlockFetcher.DownloadCallback.onData may write just a part of data > --- > > Key: SPARK-23289 > URL: https://issues.apache.org/jira/browse/SPARK-23289 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23289) OneForOneBlockFetcher.DownloadCallback.onData may write just a part of data
[ https://issues.apache.org/jira/browse/SPARK-23289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347679#comment-16347679 ] Apache Spark commented on SPARK-23289: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/20461 > OneForOneBlockFetcher.DownloadCallback.onData may write just a part of data > --- > > Key: SPARK-23289 > URL: https://issues.apache.org/jira/browse/SPARK-23289 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23289) OneForOneBlockFetcher.DownloadCallback.onData may write just a part of data
[ https://issues.apache.org/jira/browse/SPARK-23289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23289: Assignee: Shixiong Zhu (was: Apache Spark) > OneForOneBlockFetcher.DownloadCallback.onData may write just a part of data > --- > > Key: SPARK-23289 > URL: https://issues.apache.org/jira/browse/SPARK-23289 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edwina Lu updated SPARK-23206: -- Attachment: StageTab.png > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: ExecutorsTab.png, MemoryTuningMetricsDesignDoc.pdf, > StageTab.png > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edwina Lu updated SPARK-23206: -- Attachment: (was: StageTab.png) > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: ExecutorsTab.png, MemoryTuningMetricsDesignDoc.pdf > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23289) OneForOneBlockFetcher.DownloadCallback.onData may write just a part of data
Shixiong Zhu created SPARK-23289: Summary: OneForOneBlockFetcher.DownloadCallback.onData may write just a part of data Key: SPARK-23289 URL: https://issues.apache.org/jira/browse/SPARK-23289 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.1, 2.2.0, 2.3.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347658#comment-16347658 ] Yanbo Liang commented on SPARK-23107: - [~sameerag] Yes, the fix should be API scope change, but I think we can get them merged in one or two days. When do you plan to cut the next RC? Thanks. > ML, Graph 2.3 QA: API: New Scala APIs, docs > --- > > Key: SPARK-23107 > URL: https://issues.apache.org/jira/browse/SPARK-23107 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Blocker > > Audit new public Scala APIs added to MLlib & GraphX. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please create JIRAs and link them to this issue. > For *user guide issues* link the new JIRAs to the relevant user guide QA > issue (SPARK-23111 for {{2.3}}) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23285) Allow spark.executor.cores to be fractional
[ https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23285: Assignee: (was: Apache Spark) > Allow spark.executor.cores to be fractional > --- > > Key: SPARK-23285 > URL: https://issues.apache.org/jira/browse/SPARK-23285 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Scheduler, Spark Submit >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Minor > > There is a strong check for an integral number of cores per executor in > [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272]. > Given we're reusing that property in K8s, does it make sense to relax it? > > K8s treats CPU as a "compressible resource" and can actually assign millicpus > to individual containers. Also to be noted - spark.driver.cores has no such > check in place. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23285) Allow spark.executor.cores to be fractional
[ https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23285: Assignee: Apache Spark > Allow spark.executor.cores to be fractional > --- > > Key: SPARK-23285 > URL: https://issues.apache.org/jira/browse/SPARK-23285 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Scheduler, Spark Submit >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Assignee: Apache Spark >Priority: Minor > > There is a strong check for an integral number of cores per executor in > [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272]. > Given we're reusing that property in K8s, does it make sense to relax it? > > K8s treats CPU as a "compressible resource" and can actually assign millicpus > to individual containers. Also to be noted - spark.driver.cores has no such > check in place. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23285) Allow spark.executor.cores to be fractional
[ https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347651#comment-16347651 ] Apache Spark commented on SPARK-23285: -- User 'liyinan926' has created a pull request for this issue: https://github.com/apache/spark/pull/20460 > Allow spark.executor.cores to be fractional > --- > > Key: SPARK-23285 > URL: https://issues.apache.org/jira/browse/SPARK-23285 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Scheduler, Spark Submit >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Minor > > There is a strong check for an integral number of cores per executor in > [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272]. > Given we're reusing that property in K8s, does it make sense to relax it? > > K8s treats CPU as a "compressible resource" and can actually assign millicpus > to individual containers. Also to be noted - spark.driver.cores has no such > check in place. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23107: Assignee: Apache Spark (was: Yanbo Liang) > ML, Graph 2.3 QA: API: New Scala APIs, docs > --- > > Key: SPARK-23107 > URL: https://issues.apache.org/jira/browse/SPARK-23107 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Blocker > > Audit new public Scala APIs added to MLlib & GraphX. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please create JIRAs and link them to this issue. > For *user guide issues* link the new JIRAs to the relevant user guide QA > issue (SPARK-23111 for {{2.3}}) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23107: Assignee: Yanbo Liang (was: Apache Spark) > ML, Graph 2.3 QA: API: New Scala APIs, docs > --- > > Key: SPARK-23107 > URL: https://issues.apache.org/jira/browse/SPARK-23107 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Blocker > > Audit new public Scala APIs added to MLlib & GraphX. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please create JIRAs and link them to this issue. > For *user guide issues* link the new JIRAs to the relevant user guide QA > issue (SPARK-23111 for {{2.3}}) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347592#comment-16347592 ] Apache Spark commented on SPARK-23107: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/20459 > ML, Graph 2.3 QA: API: New Scala APIs, docs > --- > > Key: SPARK-23107 > URL: https://issues.apache.org/jira/browse/SPARK-23107 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Blocker > > Audit new public Scala APIs added to MLlib & GraphX. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please create JIRAs and link them to this issue. > For *user guide issues* link the new JIRAs to the relevant user guide QA > issue (SPARK-23111 for {{2.3}}) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347576#comment-16347576 ] Sameer Agarwal commented on SPARK-23107: [~yanboliang] other than adding docs, are you considering any pending API changes that should block the next RC? > ML, Graph 2.3 QA: API: New Scala APIs, docs > --- > > Key: SPARK-23107 > URL: https://issues.apache.org/jira/browse/SPARK-23107 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Blocker > > Audit new public Scala APIs added to MLlib & GraphX. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please create JIRAs and link them to this issue. > For *user guide issues* link the new JIRAs to the relevant user guide QA > issue (SPARK-23111 for {{2.3}}) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21525) ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL
[ https://issues.apache.org/jira/browse/SPARK-21525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-21525. Resolution: Fixed Fix Version/s: 2.4.0 > ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL > - > > Key: SPARK-21525 > URL: https://issues.apache.org/jira/browse/SPARK-21525 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Mark Grover >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.4.0 > > > {{AddBlock}} returns an error code related to whether writing the block to > the WAL was successful or not. In cases where a WAL may be unavailable > temporarily, the write would fail but it seems like we are not using the > return code (see > [here|https://github.com/apache/spark/blob/ba8912e5f3d5c5a366cb3d1f6be91f2471d048d2/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisorImpl.scala#L162]). > For example, when using the Flume Receiver, we should be sending a n'ack back > to Flume if the block wasn't written to the WAL. I haven't gone through the > full code path yet but at least from looking at the ReceiverSupervisorImpl, > it doesn't seem like that return code is being used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21525) ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL
[ https://issues.apache.org/jira/browse/SPARK-21525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-21525: -- Assignee: Marcelo Vanzin > ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL > - > > Key: SPARK-21525 > URL: https://issues.apache.org/jira/browse/SPARK-21525 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Mark Grover >Assignee: Marcelo Vanzin >Priority: Major > > {{AddBlock}} returns an error code related to whether writing the block to > the WAL was successful or not. In cases where a WAL may be unavailable > temporarily, the write would fail but it seems like we are not using the > return code (see > [here|https://github.com/apache/spark/blob/ba8912e5f3d5c5a366cb3d1f6be91f2471d048d2/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisorImpl.scala#L162]). > For example, when using the Flume Receiver, we should be sending a n'ack back > to Flume if the block wasn't written to the WAL. I haven't gone through the > full code path yet but at least from looking at the ReceiverSupervisorImpl, > it doesn't seem like that return code is being used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23285) Allow spark.executor.cores to be fractional
[ https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347484#comment-16347484 ] Yinan Li commented on SPARK-23285: -- Another option is to bypass that check for Kubernetes mode. This minimizes the code changes. Thoughts? > Allow spark.executor.cores to be fractional > --- > > Key: SPARK-23285 > URL: https://issues.apache.org/jira/browse/SPARK-23285 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Scheduler, Spark Submit >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Minor > > There is a strong check for an integral number of cores per executor in > [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272]. > Given we're reusing that property in K8s, does it make sense to relax it? > > K8s treats CPU as a "compressible resource" and can actually assign millicpus > to individual containers. Also to be noted - spark.driver.cores has no such > check in place. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column
[ https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347480#comment-16347480 ] Andrew Ash commented on SPARK-23274: Many thanks for the fast fix [~smilegator]! > ReplaceExceptWithFilter fails on dataframes filtered on same column > --- > > Key: SPARK-23274 > URL: https://issues.apache.org/jira/browse/SPARK-23274 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Onur Satici >Assignee: Xiao Li >Priority: Blocker > Fix For: 2.3.0 > > > Currently affects: > {code:java} > $ git tag --contains 01f6ba0e7a > v2.3.0-rc1 > v2.3.0-rc2 > {code} > Steps to reproduce: > {code:java} > $ cat test.csv > a,b > 1,2 > 1,3 > 2,2 > 2,4 > {code} > {code:java} > val df = spark.read.format("csv").option("header", "true").load("test.csv") > val df1 = df.filter($"a" === 1) > val df2 = df.filter($"a" === 2) > df1.select("b").except(df2.select("b")).show > {code} > results in: > {code:java} > java.util.NoSuchElementException: key not found: a > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:59) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at >
[jira] [Resolved] (SPARK-20826) Support compression/decompression of ColumnVector in generated code
[ https://issues.apache.org/jira/browse/SPARK-20826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved SPARK-20826. -- Resolution: Duplicate > Support compression/decompression of ColumnVector in generated code > --- > > Key: SPARK-20826 > URL: https://issues.apache.org/jira/browse/SPARK-20826 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347405#comment-16347405 ] Weichen Xu commented on SPARK-23110: [~yanboliang] OK. > ML 2.3 QA: API: Java compatibility, docs > > > Key: SPARK-23110 > URL: https://issues.apache.org/jira/browse/SPARK-23110 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Blocker > Attachments: 1_process_script.sh, added_ml_class, > different_methods_in_ML.diff > > > Check Java compatibility for this release: > * APIs in {{spark.ml}} > * New APIs in {{spark.mllib}} (There should be few, if any.) > Checking compatibility means: > * Checking for differences in how Scala and Java handle types. Some items to > look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. These may not > be understood in Java, or they may be accessible only via the weirdly named > Java types (with "$" or "#") which are generated by the Scala compiler. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. (In {{spark.ml}}, we have largely tried to avoid > using enumerations, and have instead favored plain strings.) > * Check for differences in generated Scala vs Java docs. E.g., one past > issue was that Javadocs did not respect Scala's package private modifier. > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > * Remember that we should not break APIs from previous releases. If you find > a problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). > * If needed for complex issues, create small Java unit tests which execute > each method. (Algorithmic correctness can be checked in Scala.) > Recommendations for how to complete this task: > * There are not great tools. In the past, this task has been done by: > ** Generating API docs > ** Building JAR and outputting the Java class signatures for MLlib > ** Manually inspecting and searching the docs and class signatures for issues > * If you do have ideas for better tooling, please say so we can make this > task easier in the future! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20823) Generate code to build table cache using ColumnarBatch and to get value from ColumnVector for other types
[ https://issues.apache.org/jira/browse/SPARK-20823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved SPARK-20823. -- Resolution: Won't Fix > Generate code to build table cache using ColumnarBatch and to get value from > ColumnVector for other types > - > > Key: SPARK-20823 > URL: https://issues.apache.org/jira/browse/SPARK-20823 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20821) Add compression/decompression of data to ColumnVector for other data types
[ https://issues.apache.org/jira/browse/SPARK-20821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved SPARK-20821. -- Resolution: Won't Fix > Add compression/decompression of data to ColumnVector for other data types > -- > > Key: SPARK-20821 > URL: https://issues.apache.org/jira/browse/SPARK-20821 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20820) Add compression/decompression of data to ColumnVector for other compression schemes
[ https://issues.apache.org/jira/browse/SPARK-20820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved SPARK-20820. -- Resolution: Duplicate > Add compression/decompression of data to ColumnVector for other compression > schemes > --- > > Key: SPARK-20820 > URL: https://issues.apache.org/jira/browse/SPARK-20820 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20824) Generate code to get value from table cache with wider column in ColumnarBatch
[ https://issues.apache.org/jira/browse/SPARK-20824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved SPARK-20824. -- Resolution: Won't Fix > Generate code to get value from table cache with wider column in ColumnarBatch > -- > > Key: SPARK-20824 > URL: https://issues.apache.org/jira/browse/SPARK-20824 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20825) Generate code to get value from table cache with wider column in ColumnarBatch for other data types
[ https://issues.apache.org/jira/browse/SPARK-20825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved SPARK-20825. -- Resolution: Not A Problem > Generate code to get value from table cache with wider column in > ColumnarBatch for other data types > --- > > Key: SPARK-20825 > URL: https://issues.apache.org/jira/browse/SPARK-20825 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22747) Shorten lifetime of global variables used in HashAggregateExec
[ https://issues.apache.org/jira/browse/SPARK-22747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved SPARK-22747. -- Resolution: Won't Fix > Shorten lifetime of global variables used in HashAggregateExec > -- > > Key: SPARK-22747 > URL: https://issues.apache.org/jira/browse/SPARK-22747 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > Generated code in {{HashAggregateExec}} uses global mutable variables that > are passed to successor operations thru {{consume()}} method. It may cause > issue in SPARK-22668. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)
[ https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21657: Component/s: (was: Spark Core) > Spark has exponential time complexity to explode(array of structs) > -- > > Key: SPARK-21657 > URL: https://issues.apache.org/jira/browse/SPARK-21657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0 >Reporter: Ruslan Dautkhanov >Assignee: Ohad Raviv >Priority: Major > Labels: cache, caching, collections, nested_types, performance, > pyspark, sparksql, sql > Fix For: 2.3.0 > > Attachments: ExponentialTimeGrowth.PNG, > nested-data-generator-and-test.py > > > It can take up to half a day to explode a modest-sized nested collection > (0.5m). > On a recent Xeon processors. > See attached pyspark script that reproduces this problem. > {code} > cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + > table_name).cache() > print sqlc.count() > {code} > This script generate a number of tables, with the same total number of > records across all nested collection (see `scaling` variable in loops). > `scaling` variable scales up how many nested elements in each record, but by > the same factor scales down number of records in the table. So total number > of records stays the same. > Time grows exponentially (notice log-10 vertical axis scale): > !ExponentialTimeGrowth.PNG! > At scaling of 50,000 (see attached pyspark script), it took 7 hours to > explode the nested collections (\!) of 8k records. > After 1000 elements in nested collection, time grows exponentially. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347385#comment-16347385 ] Apache Spark commented on SPARK-23110: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/20457 > ML 2.3 QA: API: Java compatibility, docs > > > Key: SPARK-23110 > URL: https://issues.apache.org/jira/browse/SPARK-23110 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Blocker > Attachments: 1_process_script.sh, added_ml_class, > different_methods_in_ML.diff > > > Check Java compatibility for this release: > * APIs in {{spark.ml}} > * New APIs in {{spark.mllib}} (There should be few, if any.) > Checking compatibility means: > * Checking for differences in how Scala and Java handle types. Some items to > look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. These may not > be understood in Java, or they may be accessible only via the weirdly named > Java types (with "$" or "#") which are generated by the Scala compiler. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. (In {{spark.ml}}, we have largely tried to avoid > using enumerations, and have instead favored plain strings.) > * Check for differences in generated Scala vs Java docs. E.g., one past > issue was that Javadocs did not respect Scala's package private modifier. > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > * Remember that we should not break APIs from previous releases. If you find > a problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). > * If needed for complex issues, create small Java unit tests which execute > each method. (Algorithmic correctness can be checked in Scala.) > Recommendations for how to complete this task: > * There are not great tools. In the past, this task has been done by: > ** Generating API docs > ** Building JAR and outputting the Java class signatures for MLlib > ** Manually inspecting and searching the docs and class signatures for issues > * If you do have ideas for better tooling, please say so we can make this > task easier in the future! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22036) BigDecimal multiplication sometimes returns null
[ https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22036: Component/s: (was: Spark Core) SQL > BigDecimal multiplication sometimes returns null > > > Key: SPARK-22036 > URL: https://issues.apache.org/jira/browse/SPARK-22036 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Olivier Blanvillain >Assignee: Marco Gaido >Priority: Major > Fix For: 2.3.0 > > > The multiplication of two BigDecimal numbers sometimes returns null. Here is > a minimal reproduction: > {code:java} > object Main extends App { > import org.apache.spark.{SparkConf, SparkContext} > import org.apache.spark.sql.SparkSession > import spark.implicits._ > val conf = new > SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", > "false") > val spark = > SparkSession.builder().config(conf).appName("REPL").getOrCreate() > implicit val sqlContext = spark.sqlContext > case class X2(a: BigDecimal, b: BigDecimal) > val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), > BigDecimal(-1000.1 > val result = ds.select(ds("a") * ds("b")).collect.head > println(result) // [null] > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347386#comment-16347386 ] Yanbo Liang commented on SPARK-23107: - [~mlnick] Sorry for late response, really busy recently. This task has been almost finished, I will submit the PR today or tomorrow. Thanks. > ML, Graph 2.3 QA: API: New Scala APIs, docs > --- > > Key: SPARK-23107 > URL: https://issues.apache.org/jira/browse/SPARK-23107 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Blocker > > Audit new public Scala APIs added to MLlib & GraphX. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please create JIRAs and link them to this issue. > For *user guide issues* link the new JIRAs to the relevant user guide QA > issue (SPARK-23111 for {{2.3}}) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23110: Assignee: Apache Spark (was: Weichen Xu) > ML 2.3 QA: API: Java compatibility, docs > > > Key: SPARK-23110 > URL: https://issues.apache.org/jira/browse/SPARK-23110 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Blocker > Attachments: 1_process_script.sh, added_ml_class, > different_methods_in_ML.diff > > > Check Java compatibility for this release: > * APIs in {{spark.ml}} > * New APIs in {{spark.mllib}} (There should be few, if any.) > Checking compatibility means: > * Checking for differences in how Scala and Java handle types. Some items to > look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. These may not > be understood in Java, or they may be accessible only via the weirdly named > Java types (with "$" or "#") which are generated by the Scala compiler. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. (In {{spark.ml}}, we have largely tried to avoid > using enumerations, and have instead favored plain strings.) > * Check for differences in generated Scala vs Java docs. E.g., one past > issue was that Javadocs did not respect Scala's package private modifier. > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > * Remember that we should not break APIs from previous releases. If you find > a problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). > * If needed for complex issues, create small Java unit tests which execute > each method. (Algorithmic correctness can be checked in Scala.) > Recommendations for how to complete this task: > * There are not great tools. In the past, this task has been done by: > ** Generating API docs > ** Building JAR and outputting the Java class signatures for MLlib > ** Manually inspecting and searching the docs and class signatures for issues > * If you do have ideas for better tooling, please say so we can make this > task easier in the future! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23110: Assignee: Weichen Xu (was: Apache Spark) > ML 2.3 QA: API: Java compatibility, docs > > > Key: SPARK-23110 > URL: https://issues.apache.org/jira/browse/SPARK-23110 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Blocker > Attachments: 1_process_script.sh, added_ml_class, > different_methods_in_ML.diff > > > Check Java compatibility for this release: > * APIs in {{spark.ml}} > * New APIs in {{spark.mllib}} (There should be few, if any.) > Checking compatibility means: > * Checking for differences in how Scala and Java handle types. Some items to > look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. These may not > be understood in Java, or they may be accessible only via the weirdly named > Java types (with "$" or "#") which are generated by the Scala compiler. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. (In {{spark.ml}}, we have largely tried to avoid > using enumerations, and have instead favored plain strings.) > * Check for differences in generated Scala vs Java docs. E.g., one past > issue was that Javadocs did not respect Scala's package private modifier. > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > * Remember that we should not break APIs from previous releases. If you find > a problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). > * If needed for complex issues, create small Java unit tests which execute > each method. (Algorithmic correctness can be checked in Scala.) > Recommendations for how to complete this task: > * There are not great tools. In the past, this task has been done by: > ** Generating API docs > ** Building JAR and outputting the Java class signatures for MLlib > ** Manually inspecting and searching the docs and class signatures for issues > * If you do have ideas for better tooling, please say so we can make this > task easier in the future! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347381#comment-16347381 ] Yanbo Liang commented on SPARK-23110: - [~mlnick] Yes, we should make it \{{private[ml]}}. I can fix it at SPARK-23107, thanks for pick it up. > ML 2.3 QA: API: Java compatibility, docs > > > Key: SPARK-23110 > URL: https://issues.apache.org/jira/browse/SPARK-23110 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Blocker > Attachments: 1_process_script.sh, added_ml_class, > different_methods_in_ML.diff > > > Check Java compatibility for this release: > * APIs in {{spark.ml}} > * New APIs in {{spark.mllib}} (There should be few, if any.) > Checking compatibility means: > * Checking for differences in how Scala and Java handle types. Some items to > look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. These may not > be understood in Java, or they may be accessible only via the weirdly named > Java types (with "$" or "#") which are generated by the Scala compiler. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. (In {{spark.ml}}, we have largely tried to avoid > using enumerations, and have instead favored plain strings.) > * Check for differences in generated Scala vs Java docs. E.g., one past > issue was that Javadocs did not respect Scala's package private modifier. > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > * Remember that we should not break APIs from previous releases. If you find > a problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). > * If needed for complex issues, create small Java unit tests which execute > each method. (Algorithmic correctness can be checked in Scala.) > Recommendations for how to complete this task: > * There are not great tools. In the past, this task has been done by: > ** Generating API docs > ** Building JAR and outputting the Java class signatures for MLlib > ** Manually inspecting and searching the docs and class signatures for issues > * If you do have ideas for better tooling, please say so we can make this > task easier in the future! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22130) UTF8String.trim() inefficiently scans all white-space string twice.
[ https://issues.apache.org/jira/browse/SPARK-22130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22130: Component/s: (was: Spark Core) SQL > UTF8String.trim() inefficiently scans all white-space string twice. > --- > > Key: SPARK-22130 > URL: https://issues.apache.org/jira/browse/SPARK-22130 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > Fix For: 2.3.0 > > > {{UTF8String.trim()}} scans a string including only white space (e.g. {{" > "}}) twice inefficiently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22247) Hive partition filter very slow
[ https://issues.apache.org/jira/browse/SPARK-22247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22247: Component/s: (was: Spark Core) > Hive partition filter very slow > --- > > Key: SPARK-22247 > URL: https://issues.apache.org/jira/browse/SPARK-22247 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2, 2.1.1 >Reporter: Patrick Duin >Priority: Minor > Fix For: 2.3.0 > > > I found an issue where filtering partitions using a dataframe results in very > bad performance. > To reproduce: > Create a hive table with a lot of partitions and write a spark query on that > table that filters based on the partition column. > In my use case I've got a table with about 30k partitions. > I filter the partitions using some scala via spark-shell: > {{table.filter("partition=x or partition=y")}} > This results in a Hive thrift API call:{{ #get_partitions('db', 'table', > -1)}} which is very slow (minutes) and loads all metastore partitions in > memory. > Doing a more simple filter: > {{table.filter("partition=x)}} > Results in a Hive Thrift API call:{{ #get_partitions_by_filter('db', 'table', > 'partition = "x', -1)}} which is very fast (seconds) and only fetches > partition X into memory. > If possible Spark should translate the filter into the more performant Thrift > call or fallback to a more scalable solution where it filters our partitions > without having to loading them all into memory first (for instance fetching > the partitions in batches). > I've posted my original question on > [SO|https://stackoverflow.com/questions/46152526/how-should-i-configure-spark-to-correctly-prune-hive-metastore-partitions] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error
[ https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22291: Component/s: (was: Spark Core) > Postgresql UUID[] to Cassandra: Conversion Error > > > Key: SPARK-22291 > URL: https://issues.apache.org/jira/browse/SPARK-22291 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, > Cassandra 3 >Reporter: Fabio J. Walter >Assignee: Jen-Ming Chung >Priority: Major > Labels: patch, postgresql, sql > Fix For: 2.2.1, 2.3.0 > > Attachments: > org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png > > > My job reads data from a PostgreSQL table that contains columns of user_ids > uuid[] type, so that I'm getting the error above when I'm trying to save data > on Cassandra. > However, the creation of this same table on Cassandra works fine! user_ids > list. > I can't change the type on the source table, because I'm reading data from a > legacy system. > I've been looking at point printed on log, on class > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala > Stacktrace on Spark: > {noformat} > Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to > [Ljava.lang.String; > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at
[jira] [Resolved] (SPARK-22935) Dataset with Java Beans for java.sql.Date throws CompileException
[ https://issues.apache.org/jira/browse/SPARK-22935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved SPARK-22935. -- Resolution: Invalid > Dataset with Java Beans for java.sql.Date throws CompileException > - > > Key: SPARK-22935 > URL: https://issues.apache.org/jira/browse/SPARK-22935 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > The following code can throw an exception with or without whole-stage codegen. > {code} > public void SPARK22935() { > Dataset cdr = spark > .read() > .format("csv") > .option("header", "true") > .option("inferSchema", "true") > .option("delimiter", ";") > .csv("CDR_SAMPLE.csv") > .as(Encoders.bean(CDR.class)); > Dataset ds = cdr.filter((FilterFunction) x -> (x.timestamp != > null)); > long c = ds.count(); > cdr.show(2); > ds.show(2); > System.out.println("cnt=" + c); > } > // CDR.java > public class CDR implements java.io.Serializable { > public java.sql.Date timestamp; > public java.sql.Date getTimestamp() { return this.timestamp; } > public void setTimestamp(java.sql.Date timestamp) { this.timestamp = > timestamp; } > } > // CDR_SAMPLE.csv > timestamp > 2017-10-29T02:37:07.815Z > 2017-10-29T02:38:07.815Z > {code} > result > {code} > 12:17:10.352 ERROR > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 61, Column 70: No applicable constructor/method found > for actual parameters "long"; candidates are: "public static java.sql.Date > org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)" > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 61, Column 70: No applicable constructor/method found for actual parameters > "long"; candidates are: "public static java.sql.Date > org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)" > at > org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821) > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347369#comment-16347369 ] Weichen Xu commented on SPARK-23110: +1. Should make it to be `private[ml]` > ML 2.3 QA: API: Java compatibility, docs > > > Key: SPARK-23110 > URL: https://issues.apache.org/jira/browse/SPARK-23110 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Blocker > Attachments: 1_process_script.sh, added_ml_class, > different_methods_in_ML.diff > > > Check Java compatibility for this release: > * APIs in {{spark.ml}} > * New APIs in {{spark.mllib}} (There should be few, if any.) > Checking compatibility means: > * Checking for differences in how Scala and Java handle types. Some items to > look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. These may not > be understood in Java, or they may be accessible only via the weirdly named > Java types (with "$" or "#") which are generated by the Scala compiler. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. (In {{spark.ml}}, we have largely tried to avoid > using enumerations, and have instead favored plain strings.) > * Check for differences in generated Scala vs Java docs. E.g., one past > issue was that Javadocs did not respect Scala's package private modifier. > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > * Remember that we should not break APIs from previous releases. If you find > a problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). > * If needed for complex issues, create small Java unit tests which execute > each method. (Algorithmic correctness can be checked in Scala.) > Recommendations for how to complete this task: > * There are not great tools. In the past, this task has been done by: > ** Generating API docs > ** Building JAR and outputting the Java class signatures for MLlib > ** Manually inspecting and searching the docs and class signatures for issues > * If you do have ideas for better tooling, please say so we can make this > task easier in the future! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19236) Add createOrReplaceGlobalTempView
[ https://issues.apache.org/jira/browse/SPARK-19236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-19236: Component/s: (was: Spark Core) SQL > Add createOrReplaceGlobalTempView > - > > Key: SPARK-19236 > URL: https://issues.apache.org/jira/browse/SPARK-19236 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Arman Yazdani >Assignee: Arman Yazdani >Priority: Minor > Fix For: 2.2.0, 2.3.0 > > > There are 3 methods for saving a temp tables: > createTempView > createOrReplaceTempView > createGlobalTempView > but there isn't: > createOrReplaceGlobalTempView -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org