[jira] [Created] (SPARK-26251) isnan function not picking non-numeric values
Kunal Rao created SPARK-26251: - Summary: isnan function not picking non-numeric values Key: SPARK-26251 URL: https://issues.apache.org/jira/browse/SPARK-26251 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: Kunal Rao import org.apache.spark.sql.functions._ List("po box 7896", "8907", "435435").toDF("rgid").filter(isnan(col("rgid"))).show should pick "po box 7896" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26249) Extension Points Enhancements to inject a rule in order and to add a batch
[ https://issues.apache.org/jira/browse/SPARK-26249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunitha Kambhampati updated SPARK-26249: Affects Version/s: (was: 2.4.0) 3.0.0 > Extension Points Enhancements to inject a rule in order and to add a batch > -- > > Key: SPARK-26249 > URL: https://issues.apache.org/jira/browse/SPARK-26249 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Sunitha Kambhampati >Priority: Major > > +Motivation:+ > Spark has extension points API to allow third parties to extend Spark with > custom optimization rules. The current API does not allow fine grain control > on when the optimization rule will be exercised. In the current API, there > is no way to add a batch to the optimization using the SparkSessionExtensions > API, similar to the postHocOptimizationBatches in SparkOptimizer. > In our use cases, we have optimization rules that we want to add as > extensions to a batch in a specific order. > +Proposal:+ > Add 2 new API's to the existing Extension Points to allow for more > flexibility for third party users of Spark. > # Inject a optimizer rule to a batch in order > # Inject a optimizer batch in order > The design spec is here: > [https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26250) Fail to run dataframe.R examples
[ https://issues.apache.org/jira/browse/SPARK-26250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean Pierre PIN updated SPARK-26250: Description: I get an error=2 running spark-submit examples/src/main/r/dataframe.R the script is working with Rstudio but i've changed the library(SparkR) line with this one library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) i am at the top root directory of spark installation and the path variable for /bin is specified in the environment so spark-submit is found. On system window 7 Ultimate 64bits read "main" java.io.IOException: Cannot run program "Rscript": CreateProcess error=2, The system cannot find the file specified I think the issue is known for a long but i don't find any post. Thanks for answer. was: I get an error=2 running spark-submit examples/src/main/r/dataframe.R the script is working with Rstudio but i've changed the library(SparkR) line with this one library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) i am at the top root directory of spark installation and the path variable for /bin is specified in the environment so spark-submit is found. On system window 7 pro 64bits read "main" java.io.IOException: Cannot run program "Rscript": CreateProcess error=2, The system cannot find the file specified I think the issue is known for a long but i don't find any post. Thanks for answer. > Fail to run dataframe.R examples > > > Key: SPARK-26250 > URL: https://issues.apache.org/jira/browse/SPARK-26250 > Project: Spark > Issue Type: Test > Components: Examples >Affects Versions: 2.4.0 >Reporter: Jean Pierre PIN >Priority: Major > > I get an error=2 running spark-submit examples/src/main/r/dataframe.R > the script is working with Rstudio but i've changed the library(SparkR) line > with this one > library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) > i am at the top root directory of spark installation and the path variable > for /bin is specified in the environment so spark-submit is found. On system > window 7 Ultimate 64bits > read "main" java.io.IOException: Cannot run program "Rscript": CreateProcess > error=2, The system cannot find the file specified > I think the issue is known for a long but i don't find any post. > Thanks for answer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26250) Fail to run dataframe.R examples
Jean Pierre PIN created SPARK-26250: --- Summary: Fail to run dataframe.R examples Key: SPARK-26250 URL: https://issues.apache.org/jira/browse/SPARK-26250 Project: Spark Issue Type: Test Components: Examples Affects Versions: 2.4.0 Reporter: Jean Pierre PIN I get an error=2 running spark-submit examples/src/main/r/dataframe.R the script is working with Rstudio but i've changed the library(SparkR) line with this one library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) i am at the top root directory of spark installation and the path variable for /bin is specified in the environment so spark-submit is found. On system window 7 pro 64bits read "main" java.io.IOException: Cannot run program "Rscript": CreateProcess error=2, The system cannot find the file specified I think the issue is known for a long but i don't find any post. Thanks for answer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26228) OOM issue encountered when computing Gramian matrix
[ https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706751#comment-16706751 ] Chen Lin commented on SPARK-26228: -- I have tried to set spark.driver.memory from 8g to 16g. It doesn't work. > OOM issue encountered when computing Gramian matrix > > > Key: SPARK-26228 > URL: https://issues.apache.org/jira/browse/SPARK-26228 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.0 >Reporter: Chen Lin >Priority: Major > Attachments: 1.jpeg > > > {quote}/** > * Computes the Gramian matrix `A^T A`. > * > * @note This cannot be computed on matrices with more than 65535 columns. > */ > {quote} > As the above annotation of computeGramianMatrix in RowMatrix.scala said, it > supports computing on matrices with no more than 65535 columns. > However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) > when computing on matrices with 16000 columns. > The root casue seems that the TreeAggregate writes a very long buffer array > (16000*16000*8) which exceeds jvm limit(2^31 - 1). > Does RowMatrix really supports computing on matrices with no more than 65535 > columns? > I doubt that computeGramianMatrix has a very serious performance issue. > Do anyone has done some performance expriments before? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26228) OOM issue encountered when computing Gramian matrix
[ https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706721#comment-16706721 ] shahid commented on SPARK-26228: could you please increase the driver memory and check. > OOM issue encountered when computing Gramian matrix > > > Key: SPARK-26228 > URL: https://issues.apache.org/jira/browse/SPARK-26228 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.0 >Reporter: Chen Lin >Priority: Major > Attachments: 1.jpeg > > > {quote}/** > * Computes the Gramian matrix `A^T A`. > * > * @note This cannot be computed on matrices with more than 65535 columns. > */ > {quote} > As the above annotation of computeGramianMatrix in RowMatrix.scala said, it > supports computing on matrices with no more than 65535 columns. > However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) > when computing on matrices with 16000 columns. > The root casue seems that the TreeAggregate writes a very long buffer array > (16000*16000*8) which exceeds jvm limit(2^31 - 1). > Does RowMatrix really supports computing on matrices with no more than 65535 > columns? > I doubt that computeGramianMatrix has a very serious performance issue. > Do anyone has done some performance expriments before? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26249) Extension Points Enhancements to inject a rule in order and to add a batch
[ https://issues.apache.org/jira/browse/SPARK-26249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunitha Kambhampati updated SPARK-26249: Description: +Motivation:+ Spark has extension points API to allow third parties to extend Spark with custom optimization rules. The current API does not allow fine grain control on when the optimization rule will be exercised. In the current API, there is no way to add a batch to the optimization using the SparkSessionExtensions API, similar to the postHocOptimizationBatches in SparkOptimizer. In our use cases, we have optimization rules that we want to add as extensions to a batch in a specific order. +Proposal:+ Add 2 new API's to the existing Extension Points to allow for more flexibility for third party users of Spark. # Inject a optimizer rule to a batch in order # Inject a optimizer batch in order The design spec is here: [https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing] was: +Motivation:+ Spark has extension points API to allow third parties to extend Spark with custom optimization rules. The current API does not allow fine grain control on when the optimization rule will be exercised. In our use cases, we have optimization rules that we want to add as extensions to a batch in a specific order. In the current API, there is no way to add a batch to the optimization using the SparkSessionExtensions API, similar to the postHocOptimizationBatches in SparkOptimizer. +Proposal:+ Add 2 new API's to the existing Extension Points to allow for more flexibility for third party users of Spark. # Inject a optimizer rule to a batch in order # Inject a optimizer batch in order The design spec is here: [https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing] > Extension Points Enhancements to inject a rule in order and to add a batch > -- > > Key: SPARK-26249 > URL: https://issues.apache.org/jira/browse/SPARK-26249 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Sunitha Kambhampati >Priority: Major > > +Motivation:+ > Spark has extension points API to allow third parties to extend Spark with > custom optimization rules. The current API does not allow fine grain control > on when the optimization rule will be exercised. In the current API, there > is no way to add a batch to the optimization using the SparkSessionExtensions > API, similar to the postHocOptimizationBatches in SparkOptimizer. > In our use cases, we have optimization rules that we want to add as > extensions to a batch in a specific order. > +Proposal:+ > Add 2 new API's to the existing Extension Points to allow for more > flexibility for third party users of Spark. > # Inject a optimizer rule to a batch in order > # Inject a optimizer batch in order > The design spec is here: > [https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26249) Extension Points Enhancements to inject a rule in order and to add a batch
[ https://issues.apache.org/jira/browse/SPARK-26249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706720#comment-16706720 ] Sunitha Kambhampati commented on SPARK-26249: - I will post a PR soon. > Extension Points Enhancements to inject a rule in order and to add a batch > -- > > Key: SPARK-26249 > URL: https://issues.apache.org/jira/browse/SPARK-26249 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Sunitha Kambhampati >Priority: Major > > +Motivation:+ > Spark has extension points API to allow third parties to extend Spark with > custom optimization rules. The current API does not allow fine grain control > on when the optimization rule will be exercised. > In our use cases, we have optimization rules that we want to add as > extensions to a batch in a specific order. > In the current API, there is no way to add a batch to the optimization using > the SparkSessionExtensions API, similar to the postHocOptimizationBatches in > SparkOptimizer. > +Proposal:+ > Add 2 new API's to the existing Extension Points to allow for more > flexibility for third party users of Spark. > # Inject a optimizer rule to a batch in order > # Inject a optimizer batch in order > The design spec is here: > [https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26249) Extension Points Enhancements to inject a rule in order and to add a batch
[ https://issues.apache.org/jira/browse/SPARK-26249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunitha Kambhampati updated SPARK-26249: Description: +Motivation:+ Spark has extension points API to allow third parties to extend Spark with custom optimization rules. The current API does not allow fine grain control on when the optimization rule will be exercised. In our use cases, we have optimization rules that we want to add as extensions to a batch in a specific order. In the current API, there is no way to add a batch to the optimization using the SparkSessionExtensions API, similar to the postHocOptimizationBatches in SparkOptimizer. +Proposal:+ Add 2 new API's to the existing Extension Points to allow for more flexibility for third party users of Spark. # Inject a optimizer rule to a batch in order # Inject a optimizer batch in order The design spec is here: [https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing] was: +Motivation:+ Spark has extension points API to allow third parties to extend Spark with custom optimization rules. The current API does not allow fine grain control on when the optimization rule will be exercised. In our use cases, we have optimization rules that we want to add as extensions to a batch in a specific order. In the current API, there is no way to add a batch to the optimization using the SparkSessionExtensions API, similar to the postHocOptimizationBatches in SparkOptimizer. +Proposal:+ Add 2 new API's to the existing Extension Points to allow for more flexibility for third party users of Spark. # Inject a optimizer rule to a batch in order # Inject a optimizer batch in order The design spec is [here|[https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]] > Extension Points Enhancements to inject a rule in order and to add a batch > -- > > Key: SPARK-26249 > URL: https://issues.apache.org/jira/browse/SPARK-26249 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Sunitha Kambhampati >Priority: Major > > +Motivation:+ > Spark has extension points API to allow third parties to extend Spark with > custom optimization rules. The current API does not allow fine grain control > on when the optimization rule will be exercised. > In our use cases, we have optimization rules that we want to add as > extensions to a batch in a specific order. > In the current API, there is no way to add a batch to the optimization using > the SparkSessionExtensions API, similar to the postHocOptimizationBatches in > SparkOptimizer. > +Proposal:+ > Add 2 new API's to the existing Extension Points to allow for more > flexibility for third party users of Spark. > # Inject a optimizer rule to a batch in order > # Inject a optimizer batch in order > The design spec is here: > [https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26249) Extension Points Enhancements to inject a rule in order and to add a batch
[ https://issues.apache.org/jira/browse/SPARK-26249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunitha Kambhampati updated SPARK-26249: Description: +Motivation:+ Spark has extension points API to allow third parties to extend Spark with custom optimization rules. The current API does not allow fine grain control on when the optimization rule will be exercised. In our use cases, we have optimization rules that we want to add as extensions to a batch in a specific order. In the current API, there is no way to add a batch to the optimization using the SparkSessionExtensions API, similar to the postHocOptimizationBatches in SparkOptimizer. +Proposal:+ Add 2 new API's to the existing Extension Points to allow for more flexibility for third party users of Spark. # Inject a optimizer rule to a batch in order # Inject a optimizer batch in order The design spec is [here|[https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]] was: +Motivation:+ Spark has extension points API to allow third parties to extend Spark with custom optimization rules. The current API does not allow fine grain control on when the optimization rule will be exercised. In our use cases, we have optimization rules that we want to add as extensions to a batch in a specific order. In the current API, there is no way to add a batch to the optimization using the SparkSessionExtensions API, similar to the postHocOptimizationBatches in SparkOptimizer. +Proposal:+ Add 2 new API's to the existing Extension Points to allow for more flexibility for third party users of Spark. # Inject a optimizer rule to a batch in order # Inject a optimizer batch in order The design spec is [here|[https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]] > Extension Points Enhancements to inject a rule in order and to add a batch > -- > > Key: SPARK-26249 > URL: https://issues.apache.org/jira/browse/SPARK-26249 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Sunitha Kambhampati >Priority: Major > > +Motivation:+ > Spark has extension points API to allow third parties to extend Spark with > custom optimization rules. The current API does not allow fine grain control > on when the optimization rule will be exercised. > In our use cases, we have optimization rules that we want to add as > extensions to a batch in a specific order. > In the current API, there is no way to add a batch to the optimization using > the SparkSessionExtensions API, similar to the postHocOptimizationBatches in > SparkOptimizer. > +Proposal:+ > Add 2 new API's to the existing Extension Points to allow for more > flexibility for third party users of Spark. > # Inject a optimizer rule to a batch in order > # Inject a optimizer batch in order > The design spec is > [here|[https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26249) Extension Points Enhancements to inject a rule in order and to add a batch
[ https://issues.apache.org/jira/browse/SPARK-26249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunitha Kambhampati updated SPARK-26249: Description: +Motivation:+ Spark has extension points API to allow third parties to extend Spark with custom optimization rules. The current API does not allow fine grain control on when the optimization rule will be exercised. In our use cases, we have optimization rules that we want to add as extensions to a batch in a specific order. In the current API, there is no way to add a batch to the optimization using the SparkSessionExtensions API, similar to the postHocOptimizationBatches in SparkOptimizer. +Proposal:+ Add 2 new API's to the existing Extension Points to allow for more flexibility for third party users of Spark. # Inject a optimizer rule to a batch in order # Inject a optimizer batch in order The design spec is [here|[https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]] was: +Motivation:+ Spark has extension points API to allow third parties to extend Spark with custom optimization rules. The current API does not allow fine grain control on when the optimization rule will be exercised. In our use cases, we have optimization rules that we want to add as extensions to a batch in a specific order. In the current API, there is no way to add a batch to the optimization using the SparkSessionExtensions API, similar to the postHocOptimizationBatches in SparkOptimizer. +Proposal:+ Add 2 new API's to the existing Extension Points to allow for more flexibility for third party users of Spark. # Inject a optimizer rule to a batch in order # Inject a optimizer batch in order > Extension Points Enhancements to inject a rule in order and to add a batch > -- > > Key: SPARK-26249 > URL: https://issues.apache.org/jira/browse/SPARK-26249 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Sunitha Kambhampati >Priority: Major > > +Motivation:+ > Spark has extension points API to allow third parties to extend Spark with > custom optimization rules. The current API does not allow fine grain control > on when the optimization rule will be exercised. > In our use cases, we have optimization rules that we want to add as > extensions to a batch in a specific order. > In the current API, there is no way to add a batch to the optimization using > the SparkSessionExtensions API, similar to the postHocOptimizationBatches in > SparkOptimizer. > +Proposal:+ > Add 2 new API's to the existing Extension Points to allow for more > flexibility for third party users of Spark. > # Inject a optimizer rule to a batch in order > # Inject a optimizer batch in order > The design spec is > [here|[https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26249) Extension Points Enhancements to inject a rule in order and to add a batch
Sunitha Kambhampati created SPARK-26249: --- Summary: Extension Points Enhancements to inject a rule in order and to add a batch Key: SPARK-26249 URL: https://issues.apache.org/jira/browse/SPARK-26249 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Sunitha Kambhampati +Motivation:+ Spark has extension points API to allow third parties to extend Spark with custom optimization rules. The current API does not allow fine grain control on when the optimization rule will be exercised. In our use cases, we have optimization rules that we want to add as extensions to a batch in a specific order. In the current API, there is no way to add a batch to the optimization using the SparkSessionExtensions API, similar to the postHocOptimizationBatches in SparkOptimizer. +Proposal:+ Add 2 new API's to the existing Extension Points to allow for more flexibility for third party users of Spark. # Inject a optimizer rule to a batch in order # Inject a optimizer batch in order -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26228) OOM issue encountered when computing Gramian matrix
[ https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Lin updated SPARK-26228: - Attachment: 1.jpeg > OOM issue encountered when computing Gramian matrix > > > Key: SPARK-26228 > URL: https://issues.apache.org/jira/browse/SPARK-26228 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.0 >Reporter: Chen Lin >Priority: Major > Attachments: 1.jpeg > > > {quote}/** > * Computes the Gramian matrix `A^T A`. > * > * @note This cannot be computed on matrices with more than 65535 columns. > */ > {quote} > As the above annotation of computeGramianMatrix in RowMatrix.scala said, it > supports computing on matrices with no more than 65535 columns. > However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) > when computing on matrices with 16000 columns. > The root casue seems that the TreeAggregate writes a very long buffer array > (16000*16000*8) which exceeds jvm limit(2^31 - 1). > Does RowMatrix really supports computing on matrices with no more than 65535 > columns? > I doubt that computeGramianMatrix has a very serious performance issue. > Do anyone has done some performance expriments before? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26228) OOM issue encountered when computing Gramian matrix
[ https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706706#comment-16706706 ] Chen Lin commented on SPARK-26228: -- Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:3236) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at org.apache.spark.SparkContext.clean(SparkContext.scala:2292) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2124) at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1092) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.fold(RDD.scala:1086) at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1131) at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeGramianMatrix(RowMatrix.scala:123) at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:345) at org.apache.spark.mllib.stat.correlation.PearsonCorrelation$.computeCorrelationMatrix(PearsonCorrelation.scala:49) at org.apache.spark.mllib.stat.correlation.Correlations$.corrMatrix(Correlation.scala:66) at org.apache.spark.mllib.stat.Statistics$.corr(Statistics.scala:57) > OOM issue encountered when computing Gramian matrix > > > Key: SPARK-26228 > URL: https://issues.apache.org/jira/browse/SPARK-26228 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.0 >Reporter: Chen Lin >Priority: Major > Attachments: 1.jpeg > > > {quote}/** > * Computes the Gramian matrix `A^T A`. > * > * @note This cannot be computed on matrices with more than 65535 columns. > */ > {quote} > As the above annotation of computeGramianMatrix in RowMatrix.scala said, it > supports computing on matrices with no more than 65535 columns. > However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) > when computing on matrices with 16000 columns. > The root casue seems that the TreeAggregate writes a very long buffer array > (16000*16000*8) which exceeds jvm limit(2^31 - 1). > Does RowMatrix really supports computing on matrices with no more than 65535 > columns? > I doubt that computeGramianMatrix has a very serious performance issue. > Do anyone has done some performance expriments before? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26228) OOM issue encountered when computing Gramian matrix
[ https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706701#comment-16706701 ] Chen Lin commented on SPARK-26228: -- [~shahid] I have upload the screenshot of log. I doubt there are extra costs when writing a size of 16000*16000*8 byte array. > OOM issue encountered when computing Gramian matrix > > > Key: SPARK-26228 > URL: https://issues.apache.org/jira/browse/SPARK-26228 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.0 >Reporter: Chen Lin >Priority: Major > Attachments: 1.jpeg > > > {quote}/** > * Computes the Gramian matrix `A^T A`. > * > * @note This cannot be computed on matrices with more than 65535 columns. > */ > {quote} > As the above annotation of computeGramianMatrix in RowMatrix.scala said, it > supports computing on matrices with no more than 65535 columns. > However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) > when computing on matrices with 16000 columns. > The root casue seems that the TreeAggregate writes a very long buffer array > (16000*16000*8) which exceeds jvm limit(2^31 - 1). > Does RowMatrix really supports computing on matrices with no more than 65535 > columns? > I doubt that computeGramianMatrix has a very serious performance issue. > Do anyone has done some performance expriments before? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26228) OOM issue encountered when computing Gramian matrix
[ https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Lin updated SPARK-26228: - Description: {quote}/** * Computes the Gramian matrix `A^T A`. * * @note This cannot be computed on matrices with more than 65535 columns. */ {quote} As the above annotation of computeGramianMatrix in RowMatrix.scala said, it supports computing on matrices with no more than 65535 columns. However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) when computing on matrices with 16000 columns. The root casue seems that the TreeAggregate writes a very long buffer array (16000*16000*8) which exceeds jvm limit(2^31 - 1). Does RowMatrix really supports computing on matrices with no more than 65535 columns? I doubt that computeGramianMatrix has a very serious performance issue. Do anyone has done some performance expriments before? was: {quote}/** * Computes the Gramian matrix `A^T A`. * * @note This cannot be computed on matrices with more than 65535 columns. */ {quote} As the above annotation of computeGramianMatrix in RowMatrix.scala said, it supports computing on matrices with no more than 65535 columns. However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) when computing on matrices with 16000 columns. The root casue seems that the TreeAggregate writes a very long buffer array (16000*16000*8) which exceeds jvm limit(2^31 - 1). Does RowMatrix really supports computing on matrices with no more than 65535 columns? I doubt that computeGramianMatrix has a very serious performance issue. Do anyone has done some performance expriments before? > OOM issue encountered when computing Gramian matrix > > > Key: SPARK-26228 > URL: https://issues.apache.org/jira/browse/SPARK-26228 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.0 >Reporter: Chen Lin >Priority: Major > Attachments: 1.jpeg > > > {quote}/** > * Computes the Gramian matrix `A^T A`. > * > * @note This cannot be computed on matrices with more than 65535 columns. > */ > {quote} > As the above annotation of computeGramianMatrix in RowMatrix.scala said, it > supports computing on matrices with no more than 65535 columns. > However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) > when computing on matrices with 16000 columns. > The root casue seems that the TreeAggregate writes a very long buffer array > (16000*16000*8) which exceeds jvm limit(2^31 - 1). > Does RowMatrix really supports computing on matrices with no more than 65535 > columns? > I doubt that computeGramianMatrix has a very serious performance issue. > Do anyone has done some performance expriments before? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
[ https://issues.apache.org/jira/browse/SPARK-26117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706691#comment-16706691 ] Apache Spark commented on SPARK-26117: -- User 'heary-cao' has created a pull request for this issue: https://github.com/apache/spark/pull/23190 > use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception > -- > > Key: SPARK-26117 > URL: https://issues.apache.org/jira/browse/SPARK-26117 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.5.0 >Reporter: caoxuewen >Assignee: caoxuewen >Priority: Major > Fix For: 3.0.0 > > > the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the > entire executor when an OutOfMemoryError is thrown. > so apply for memory using MemoryConsumer. allocatePage when catch exception, > use SparkOutOfMemoryError instead of OutOfMemoryError -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
[ https://issues.apache.org/jira/browse/SPARK-26117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706690#comment-16706690 ] Apache Spark commented on SPARK-26117: -- User 'heary-cao' has created a pull request for this issue: https://github.com/apache/spark/pull/23190 > use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception > -- > > Key: SPARK-26117 > URL: https://issues.apache.org/jira/browse/SPARK-26117 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.5.0 >Reporter: caoxuewen >Assignee: caoxuewen >Priority: Major > Fix For: 3.0.0 > > > the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the > entire executor when an OutOfMemoryError is thrown. > so apply for memory using MemoryConsumer. allocatePage when catch exception, > use SparkOutOfMemoryError instead of OutOfMemoryError -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26198) Metadata serialize null values throw NPE
[ https://issues.apache.org/jira/browse/SPARK-26198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-26198: -- Fix Version/s: 2.4.1 2.3.3 > Metadata serialize null values throw NPE > > > Key: SPARK-26198 > URL: https://issues.apache.org/jira/browse/SPARK-26198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Minor > Fix For: 2.3.3, 2.4.1, 3.0.0 > > > How to reproduce this issue: > {code} > scala> val meta = new > org.apache.spark.sql.types.MetadataBuilder().putNull("key").build().json > java.lang.NullPointerException > at > org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196) > at org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26228) OOM issue encountered when computing Gramian matrix
[ https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706667#comment-16706667 ] shahid edited comment on SPARK-26228 at 12/3/18 5:25 AM: - Hi [~hibayesian], could you please share the full log of the error, if you have. Thanks (btw 16000*16000*8 < 2^31 -1 ) was (Author: shahid): Hi [~hibayesian], could you please share the full log of the error, if you have. Thanks > OOM issue encountered when computing Gramian matrix > > > Key: SPARK-26228 > URL: https://issues.apache.org/jira/browse/SPARK-26228 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.0 >Reporter: Chen Lin >Priority: Major > > {quote}/** > * Computes the Gramian matrix `A^T A`. > * > * @note This cannot be computed on matrices with more than 65535 columns. > */ > {quote} > As the above annotation of computeGramianMatrix in RowMatrix.scala said, it > supports computing on matrices with no more than 65535 columns. > However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) > when computing on matrices with 16000 columns. > The root casue seems that the TreeAggregate writes a very long buffer array > (16000*16000*8) which exceeds jvm limit(2^31 - 1). > Does RowMatrix really supports computing on matrices with no more than 65535 > columns? > I doubt that computeGramianMatrix has a very serious performance issue. > Do anyone has done some performance expriments before? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26228) OOM issue encountered when computing Gramian matrix
[ https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706667#comment-16706667 ] shahid commented on SPARK-26228: Hi [~hibayesian], could you please share the full log of the error, if you have. Thanks > OOM issue encountered when computing Gramian matrix > > > Key: SPARK-26228 > URL: https://issues.apache.org/jira/browse/SPARK-26228 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.0 >Reporter: Chen Lin >Priority: Major > > {quote}/** > * Computes the Gramian matrix `A^T A`. > * > * @note This cannot be computed on matrices with more than 65535 columns. > */ > {quote} > As the above annotation of computeGramianMatrix in RowMatrix.scala said, it > supports computing on matrices with no more than 65535 columns. > However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) > when computing on matrices with 16000 columns. > The root casue seems that the TreeAggregate writes a very long buffer array > (16000*16000*8) which exceeds jvm limit(2^31 - 1). > Does RowMatrix really supports computing on matrices with no more than 65535 > columns? > I doubt that computeGramianMatrix has a very serious performance issue. > Do anyone has done some performance expriments before? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving
[ https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-26247: - Description: This ticket tracks an SPIP to improve model load time and model serving interfaces for online serving of Spark MLlib models. The SPIP is here [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub] The improvement opportunity exists in all versions of spark. We developed our set of changes wrt version 2.1.0 and can port them forward to other versions (e.g., we have ported them forward to 2.3.2). was: This ticket tracks an SPIP to improve model load time and model serving interfaces for online serving of Spark MLlib models. The SPIP is here [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub] The improvement opportunity exists in all versions of spark. We developed our set of changes wrt version 2.1.0 and can port them forward to other versions (e.g., wehave ported them forward to 2.3.2). > SPIP - ML Model Extension for no-Spark MLLib Online Serving > --- > > Key: SPARK-26247 > URL: https://issues.apache.org/jira/browse/SPARK-26247 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Anne Holler >Priority: Major > Labels: SPIP > > This ticket tracks an SPIP to improve model load time and model serving > interfaces for online serving of Spark MLlib models. The SPIP is here > [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub] > > The improvement opportunity exists in all versions of spark. We developed > our set of changes wrt version 2.1.0 and can port them forward to other > versions (e.g., we have ported them forward to 2.3.2). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving
[ https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-26247: - Target Version/s: 3.0.0 (was: 2.1.0) > SPIP - ML Model Extension for no-Spark MLLib Online Serving > --- > > Key: SPARK-26247 > URL: https://issues.apache.org/jira/browse/SPARK-26247 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Anne Holler >Priority: Major > Labels: SPIP > > This ticket tracks an SPIP to improve model load time and model serving > interfaces for online serving of Spark MLlib models. The SPIP is here > [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub] > > The improvement opportunity exists in all versions of spark. We developed > our set of changes wrt version 2.1.0 and can port them forward to other > versions (e.g., wehave ported them forward to 2.3.2). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving
[ https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-26247: - Fix Version/s: (was: 2.1.0) > SPIP - ML Model Extension for no-Spark MLLib Online Serving > --- > > Key: SPARK-26247 > URL: https://issues.apache.org/jira/browse/SPARK-26247 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Anne Holler >Priority: Major > Labels: SPIP > > This ticket tracks an SPIP to improve model load time and model serving > interfaces for online serving of Spark MLlib models. The SPIP is here > [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub] > > The improvement opportunity exists in all versions of spark. We developed > our set of changes wrt version 2.1.0 and can port them forward to other > versions (e.g., wehave ported them forward to 2.3.2). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26248) Infer date type from CSV
[ https://issues.apache.org/jira/browse/SPARK-26248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706494#comment-16706494 ] Apache Spark commented on SPARK-26248: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/23202 > Infer date type from CSV > > > Key: SPARK-26248 > URL: https://issues.apache.org/jira/browse/SPARK-26248 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, DateType cannot be inferred from CSV. To parse CSV string, you > have to specify schema explicitly if CSV input contains dates. This ticket > aims to extend CSVInferSchema to support such inferring. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26248) Infer date type from CSV
[ https://issues.apache.org/jira/browse/SPARK-26248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-26248: --- Summary: Infer date type from CSV (was: Infer date type from JSON) > Infer date type from CSV > > > Key: SPARK-26248 > URL: https://issues.apache.org/jira/browse/SPARK-26248 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, DateType cannot be inferred from CSV. To parse CSV string, you > have to specify schema explicitly if CSV input contains dates. This ticket > aims to extend CSVInferSchema to support such inferring. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26248) Infer date type from CSV
[ https://issues.apache.org/jira/browse/SPARK-26248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26248: Assignee: (was: Apache Spark) > Infer date type from CSV > > > Key: SPARK-26248 > URL: https://issues.apache.org/jira/browse/SPARK-26248 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, DateType cannot be inferred from CSV. To parse CSV string, you > have to specify schema explicitly if CSV input contains dates. This ticket > aims to extend CSVInferSchema to support such inferring. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26248) Infer date type from CSV
[ https://issues.apache.org/jira/browse/SPARK-26248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26248: Assignee: Apache Spark > Infer date type from CSV > > > Key: SPARK-26248 > URL: https://issues.apache.org/jira/browse/SPARK-26248 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > Currently, DateType cannot be inferred from CSV. To parse CSV string, you > have to specify schema explicitly if CSV input contains dates. This ticket > aims to extend CSVInferSchema to support such inferring. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26248) Infer date type from JSON
Maxim Gekk created SPARK-26248: -- Summary: Infer date type from JSON Key: SPARK-26248 URL: https://issues.apache.org/jira/browse/SPARK-26248 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk Currently, DateType cannot be inferred from CSV. To parse CSV string, you have to specify schema explicitly if CSV input contains dates. This ticket aims to extend CSVInferSchema to support such inferring. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving
Anne Holler created SPARK-26247: --- Summary: SPIP - ML Model Extension for no-Spark MLLib Online Serving Key: SPARK-26247 URL: https://issues.apache.org/jira/browse/SPARK-26247 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.1.0 Reporter: Anne Holler Fix For: 2.1.0 This ticket tracks an SPIP to improve model load time and model serving interfaces for online serving of Spark MLlib models. The SPIP is here [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub] The improvement opportunity exists in all versions of spark. We developed our set of changes wrt version 2.1.0 and can port them forward to other versions (e.g., wehave ported them forward to 2.3.2). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26246) Infer date and timestamp types from JSON
[ https://issues.apache.org/jira/browse/SPARK-26246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706466#comment-16706466 ] Apache Spark commented on SPARK-26246: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/23201 > Infer date and timestamp types from JSON > > > Key: SPARK-26246 > URL: https://issues.apache.org/jira/browse/SPARK-26246 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, DateType and TimestampType cannot be inferred from JSON. To parse > JSON string, you have to specify schema explicitly if JSON input contains > dates or timestamps. This ticket aims to extend JsonInferSchema to support > such inferring. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26246) Infer date and timestamp types from JSON
[ https://issues.apache.org/jira/browse/SPARK-26246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26246: Assignee: Apache Spark > Infer date and timestamp types from JSON > > > Key: SPARK-26246 > URL: https://issues.apache.org/jira/browse/SPARK-26246 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > Currently, DateType and TimestampType cannot be inferred from JSON. To parse > JSON string, you have to specify schema explicitly if JSON input contains > dates or timestamps. This ticket aims to extend JsonInferSchema to support > such inferring. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26246) Infer date and timestamp types from JSON
[ https://issues.apache.org/jira/browse/SPARK-26246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26246: Assignee: (was: Apache Spark) > Infer date and timestamp types from JSON > > > Key: SPARK-26246 > URL: https://issues.apache.org/jira/browse/SPARK-26246 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, DateType and TimestampType cannot be inferred from JSON. To parse > JSON string, you have to specify schema explicitly if JSON input contains > dates or timestamps. This ticket aims to extend JsonInferSchema to support > such inferring. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26246) Infer date and timestamp types from JSON
Maxim Gekk created SPARK-26246: -- Summary: Infer date and timestamp types from JSON Key: SPARK-26246 URL: https://issues.apache.org/jira/browse/SPARK-26246 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk Currently, DateType and TimestampType cannot be inferred from JSON. To parse JSON string, you have to specify schema explicitly if JSON input contains dates or timestamps. This ticket aims to extend JsonInferSchema to support such inferring. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-26139) Support passing shuffle metrics to exchange operator
[ https://issues.apache.org/jira/browse/SPARK-26139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-26139: Comment: was deleted (was: User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/23128) > Support passing shuffle metrics to exchange operator > > > Key: SPARK-26139 > URL: https://issues.apache.org/jira/browse/SPARK-26139 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > Due to the way Spark's architected (SQL is defined on top of the RDD API), > there are two separate metrics system used in core vs SQL. Ideally, we'd want > to be able to get the shuffle metrics for each of the exchange operator > independently, e.g. blocks read, number of records. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26165) Date and Timestamp column expression is getting converted to string in less than/greater than filter query even though valid date/timestamp string literal is used in th
[ https://issues.apache.org/jira/browse/SPARK-26165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26165. --- Resolution: Won't Fix > Date and Timestamp column expression is getting converted to string in less > than/greater than filter query even though valid date/timestamp string > literal is used in the right side filter expression > -- > > Key: SPARK-26165 > URL: https://issues.apache.org/jira/browse/SPARK-26165 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sujith >Priority: Major > Attachments: image-2018-11-26-13-00-36-896.png, > image-2018-11-26-13-01-28-299.png, timestamp_filter_perf.PNG > > > Date and Timestamp column is getting converted to string in less than/greater > than filter query even though date strings that contains a time, like > '2018-03-18" 12:39:40' to date. Besides it's not possible to cast a string > like '2018-03-18 12:39:40' to a timestamp. > > scala> spark.sql("""explain extended SELECT username FROM orders WHERE > order_creation_date > '2017-02-26 13:45:12'""").show(false); > +--- > |== Parsed Logical Plan == > 'Project ['username] > +- 'Filter ('order_creation_date > 2017-02-26 13:45:12) > +- 'UnresolvedRelation `orders` > == Analyzed Logical Plan == > username: string > Project [username#59] > +- Filter (cast(order_creation_date#60 as string) > 2017-02-26 13:45:12) > +- SubqueryAlias orders > +- HiveTableRelation `default`.`orders`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [username#59, > order_creation_date#60, amount#61] > == Optimized Logical Plan == > Project [username#59] > +- Filter (isnotnull(order_creation_date#60) && (cast(order_creation_date#60 > as string) > 2017-02-26 13:45:12)) > +- HiveTableRelation `default`.`orders`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [username#59, > order_creation_date#60, amount#61] > == Physical Plan == > *(1) Project [username#59] > +- *(1) Filter (isnotnull(order_creation_date#60) && > (cast(order_creation_date#60 as string) > 2017-02-26 13:45:12)) > +- HiveTableScan [order_creation_date#60, username#59], HiveTableRelation > `default`.`orders`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > [username#59, order_creation > + > - -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26193) Implement shuffle write metrics in SQL
[ https://issues.apache.org/jira/browse/SPARK-26193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706368#comment-16706368 ] Reynold Xin commented on SPARK-26193: - Can we simplify it and add those metrics only to the same exchange operator as the read side? > Implement shuffle write metrics in SQL > -- > > Key: SPARK-26193 > URL: https://issues.apache.org/jira/browse/SPARK-26193 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Yuanjian Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26193) Implement shuffle write metrics in SQL
[ https://issues.apache.org/jira/browse/SPARK-26193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706359#comment-16706359 ] Yuanjian Li commented on SPARK-26193: - cc [~smilegator] [~cloud_fan] and [~rxin], cause the writer side of shuffle metrics need more changes than reader side, add a sketch design and demo doc in this jira, I'll give a PR soon after you think the implement describe in doc is ok. Thanks :) > Implement shuffle write metrics in SQL > -- > > Key: SPARK-26193 > URL: https://issues.apache.org/jira/browse/SPARK-26193 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Yuanjian Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26198) Metadata serialize null values throw NPE
[ https://issues.apache.org/jira/browse/SPARK-26198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26198. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23164 [https://github.com/apache/spark/pull/23164] > Metadata serialize null values throw NPE > > > Key: SPARK-26198 > URL: https://issues.apache.org/jira/browse/SPARK-26198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Minor > Fix For: 3.0.0 > > > How to reproduce this issue: > {code} > scala> val meta = new > org.apache.spark.sql.types.MetadataBuilder().putNull("key").build().json > java.lang.NullPointerException > at > org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196) > at org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26198) Metadata serialize null values throw NPE
[ https://issues.apache.org/jira/browse/SPARK-26198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26198: - Assignee: Yuming Wang > Metadata serialize null values throw NPE > > > Key: SPARK-26198 > URL: https://issues.apache.org/jira/browse/SPARK-26198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Minor > Fix For: 3.0.0 > > > How to reproduce this issue: > {code} > scala> val meta = new > org.apache.spark.sql.types.MetadataBuilder().putNull("key").build().json > java.lang.NullPointerException > at > org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196) > at org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26034) Break large mllib/tests.py files into smaller files
[ https://issues.apache.org/jira/browse/SPARK-26034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706176#comment-16706176 ] Apache Spark commented on SPARK-26034: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/23200 > Break large mllib/tests.py files into smaller files > --- > > Key: SPARK-26034 > URL: https://issues.apache.org/jira/browse/SPARK-26034 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Bryan Cutler >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26034) Break large mllib/tests.py files into smaller files
[ https://issues.apache.org/jira/browse/SPARK-26034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706175#comment-16706175 ] Apache Spark commented on SPARK-26034: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/23200 > Break large mllib/tests.py files into smaller files > --- > > Key: SPARK-26034 > URL: https://issues.apache.org/jira/browse/SPARK-26034 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Bryan Cutler >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26033) Break large ml/tests.py files into smaller files
[ https://issues.apache.org/jira/browse/SPARK-26033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706174#comment-16706174 ] Apache Spark commented on SPARK-26033: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/23200 > Break large ml/tests.py files into smaller files > > > Key: SPARK-26033 > URL: https://issues.apache.org/jira/browse/SPARK-26033 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Bryan Cutler >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26242) Leading slash breaks proxying
[ https://issues.apache.org/jira/browse/SPARK-26242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido resolved SPARK-26242. - Resolution: Not A Problem > Leading slash breaks proxying > - > > Key: SPARK-26242 > URL: https://issues.apache.org/jira/browse/SPARK-26242 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Ryan Lovett >Priority: Minor > > The WebUI prefixes "/" at the beginning of each link path (e.g. /jobs) which > breaks navigation when attempting to proxy the app at another URL. In my > case, a pyspark user creates a SparkContext within a JupyterHub-hosted > notebook and attempts to proxy it with nbserverproxy off of > address.of.jupyter.hub/user/proxy/4040/. Since WebUI sets the URLs of its > pages to begin with "/", the browser sends the user to > address.of.jupyter.hub/jobs, address.of.jupyter.hub/stages, etc. > > Similar: > [https://github.com/mesosphere/spark/commit/ada99f1b3801e81db2e367f219377e93f5d32655|https://github.com/apache/spark/pull/11369] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26242) Leading slash breaks proxying
[ https://issues.apache.org/jira/browse/SPARK-26242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706157#comment-16706157 ] Marco Gaido commented on SPARK-26242: - Let me close this. Please reopen only if you find issues. In the future, please, if you have questions send them to the mailing lists and open a JIRA only if you find an incorrect behavior. Thanks. > Leading slash breaks proxying > - > > Key: SPARK-26242 > URL: https://issues.apache.org/jira/browse/SPARK-26242 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Ryan Lovett >Priority: Minor > > The WebUI prefixes "/" at the beginning of each link path (e.g. /jobs) which > breaks navigation when attempting to proxy the app at another URL. In my > case, a pyspark user creates a SparkContext within a JupyterHub-hosted > notebook and attempts to proxy it with nbserverproxy off of > address.of.jupyter.hub/user/proxy/4040/. Since WebUI sets the URLs of its > pages to begin with "/", the browser sends the user to > address.of.jupyter.hub/jobs, address.of.jupyter.hub/stages, etc. > > Similar: > [https://github.com/mesosphere/spark/commit/ada99f1b3801e81db2e367f219377e93f5d32655|https://github.com/apache/spark/pull/11369] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23899) Built-in SQL Function Improvement
[ https://issues.apache.org/jira/browse/SPARK-23899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706156#comment-16706156 ] Arseniy Tashoyan commented on SPARK-23899: -- What do you think about this one: SPARK-23693? > Built-in SQL Function Improvement > - > > Key: SPARK-23899 > URL: https://issues.apache.org/jira/browse/SPARK-23899 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > Fix For: 2.4.0 > > > This umbrella JIRA is to improve compatibility with the other data processing > systems, including Hive, Teradata, Presto, Postgres, MySQL, DB2, Oracle, and > MS SQL Server. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26080) Unable to run worker.py on Windows
[ https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-26080: Assignee: Hyukjin Kwon > Unable to run worker.py on Windows > -- > > Key: SPARK-26080 > URL: https://issues.apache.org/jira/browse/SPARK-26080 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Windows 10 Education 64 bit >Reporter: Hayden Jeune >Assignee: Hyukjin Kwon >Priority: Blocker > > Use of the resource module in python means worker.py cannot run on a windows > system. This package is only available in unix based environments. > [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25] > {code:python} > textFile = sc.textFile("README.md") > textFile.first() > {code} > When the above commands are run I receive the error 'worker failed to connect > back', and I can see an exception in the console coming from worker.py saying > 'ModuleNotFoundError: No module named resource' > I do not really know enough about what I'm doing to fix this myself. > Apologies if there's something simple I'm missing here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26080) Unable to run worker.py on Windows
[ https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26080. -- Resolution: Fixed Fix Version/s: 2.4.1 3.0.0 Issue resolved by pull request 23055 [https://github.com/apache/spark/pull/23055] > Unable to run worker.py on Windows > -- > > Key: SPARK-26080 > URL: https://issues.apache.org/jira/browse/SPARK-26080 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Windows 10 Education 64 bit >Reporter: Hayden Jeune >Assignee: Hyukjin Kwon >Priority: Blocker > Fix For: 3.0.0, 2.4.1 > > > Use of the resource module in python means worker.py cannot run on a windows > system. This package is only available in unix based environments. > [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25] > {code:python} > textFile = sc.textFile("README.md") > textFile.first() > {code} > When the above commands are run I receive the error 'worker failed to connect > back', and I can see an exception in the console coming from worker.py saying > 'ModuleNotFoundError: No module named resource' > I do not really know enough about what I'm doing to fix this myself. > Apologies if there's something simple I'm missing here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26208) Empty dataframe does not roundtrip for csv with header
[ https://issues.apache.org/jira/browse/SPARK-26208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26208. -- Resolution: Fixed Assignee: Koert Kuipers Fix Version/s: 3.0.0 Fixed in https://github.com/apache/spark/pull/23173 > Empty dataframe does not roundtrip for csv with header > -- > > Key: SPARK-26208 > URL: https://issues.apache.org/jira/browse/SPARK-26208 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: master branch, > commit 034ae305c33b1990b3c1a284044002874c343b4d, > date: Sun Nov 18 16:02:15 2018 +0800 >Reporter: koert kuipers >Assignee: Koert Kuipers >Priority: Minor > Fix For: 3.0.0 > > > when we write empty part file for csv and header=true we fail to write > header. the result cannot be read back in. > when header=true a part file with zero rows should still have header -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26245) Add Float literal
[ https://issues.apache.org/jira/browse/SPARK-26245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706146#comment-16706146 ] Apache Spark commented on SPARK-26245: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/23199 > Add Float literal > - > > Key: SPARK-26245 > URL: https://issues.apache.org/jira/browse/SPARK-26245 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26245) Add Float literal
[ https://issues.apache.org/jira/browse/SPARK-26245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706145#comment-16706145 ] Apache Spark commented on SPARK-26245: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/23199 > Add Float literal > - > > Key: SPARK-26245 > URL: https://issues.apache.org/jira/browse/SPARK-26245 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26245) Add Float literal
[ https://issues.apache.org/jira/browse/SPARK-26245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26245: Assignee: Apache Spark > Add Float literal > - > > Key: SPARK-26245 > URL: https://issues.apache.org/jira/browse/SPARK-26245 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26245) Add Float literal
[ https://issues.apache.org/jira/browse/SPARK-26245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26245: Assignee: (was: Apache Spark) > Add Float literal > - > > Key: SPARK-26245 > URL: https://issues.apache.org/jira/browse/SPARK-26245 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26245) Add Float literal
Yuming Wang created SPARK-26245: --- Summary: Add Float literal Key: SPARK-26245 URL: https://issues.apache.org/jira/browse/SPARK-26245 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org