[jira] [Updated] (SPARK-25450) PushProjectThroughUnion rule uses the same exprId for project expressions in each Union child, causing mistakes in constant propagation
[ https://issues.apache.org/jira/browse/SPARK-25450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25450: Description: The problem was cause by the PushProjectThroughUnion rule, which, when creating new Project for each child of Union, uses the same exprId for expressions of the same position. This is wrong because, for each child of Union, the expressions are all independent, and it can lead to a wrong result if other rules like FoldablePropagation kicks in, taking two different expressions as the same. > PushProjectThroughUnion rule uses the same exprId for project expressions in > each Union child, causing mistakes in constant propagation > --- > > Key: SPARK-25450 > URL: https://issues.apache.org/jira/browse/SPARK-25450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Priority: Major > > The problem was cause by the PushProjectThroughUnion rule, which, when > creating new Project for each child of Union, uses the same exprId for > expressions of the same position. This is wrong because, for each child of > Union, the expressions are all independent, and it can lead to a wrong result > if other rules like FoldablePropagation kicks in, taking two different > expressions as the same. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25450) PushProjectThroughUnion rule uses the same exprId for project expressions in each Union child, causing mistakes in constant propagation
[ https://issues.apache.org/jira/browse/SPARK-25450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618505#comment-16618505 ] Xiao Li commented on SPARK-25450: - https://github.com/apache/spark/pull/22447 > PushProjectThroughUnion rule uses the same exprId for project expressions in > each Union child, causing mistakes in constant propagation > --- > > Key: SPARK-25450 > URL: https://issues.apache.org/jira/browse/SPARK-25450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618495#comment-16618495 ] Xiangrui Meng edited comment on SPARK-25321 at 9/18/18 5:21 AM: [~WeichenXu123] Could you check whether mleap is compatible with the tree Node breaking changes? This line is relevant: https://github.com/combust/mleap/blob/master/mleap-runtime/src/main/scala/ml/combust/mleap/bundle/ops/classification/DecisionTreeClassifierOp.scala If it is hard to make MLeap upgrade, we should revert the change in 2.4. cc: [~hollinwilkins] was (Author: mengxr): [~WeichenXu123] Could you check whether mleap is compatible with the tree Node breaking changes? This line is relevant: https://github.com/combust/mleap/blob/master/mleap-runtime/src/main/scala/ml/combust/mleap/bundle/ops/classification/DecisionTreeClassifierOp.scala If it is hard to make MLeap upgrade, we should revert the change. cc: [~hollinwilkins] > ML, Graph 2.4 QA: API: New Scala APIs, docs > --- > > Key: SPARK-25321 > URL: https://issues.apache.org/jira/browse/SPARK-25321 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.4.0 >Reporter: Weichen Xu >Assignee: Yanbo Liang >Priority: Blocker > > Audit new public Scala APIs added to MLlib & GraphX. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please create JIRAs and link them to this issue. > For *user guide issues* link the new JIRAs to the relevant user guide QA issue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618495#comment-16618495 ] Xiangrui Meng commented on SPARK-25321: --- [~WeichenXu123] Could you check whether mleap is compatible with the tree Node breaking changes? This line is relevant: https://github.com/combust/mleap/blob/master/mleap-runtime/src/main/scala/ml/combust/mleap/bundle/ops/classification/DecisionTreeClassifierOp.scala If it is hard to make MLeap upgrade, we should revert the change. cc: [~hollinwilkins] > ML, Graph 2.4 QA: API: New Scala APIs, docs > --- > > Key: SPARK-25321 > URL: https://issues.apache.org/jira/browse/SPARK-25321 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.4.0 >Reporter: Weichen Xu >Assignee: Yanbo Liang >Priority: Blocker > > Audit new public Scala APIs added to MLlib & GraphX. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please create JIRAs and link them to this issue. > For *user guide issues* link the new JIRAs to the relevant user guide QA issue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25230) Upper behavior incorrect for string contains "ß"
[ https://issues.apache.org/jira/browse/SPARK-25230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618473#comment-16618473 ] Yuming Wang commented on SPARK-25230: - May be a JDK bug: [https://bugs.openjdk.java.net/browse/JDK-8186073] > Upper behavior incorrect for string contains "ß" > > > Key: SPARK-25230 > URL: https://issues.apache.org/jira/browse/SPARK-25230 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Yuming Wang >Priority: Major > Attachments: MySQL.png, Oracle.png, Teradata.jpeg > > > How to reproduce: > {code:sql} > spark-sql> SELECT upper('Haßler'); > HASSLER > {code} > Mainstream databases returns {{HAßLER}}. > !MySQL.png! > > This behavior may lead to data inconsistency: > {code:sql} > create temporary view SPARK_25230 as select * from values > ("Hassler"), > ("Haßler") > as EMPLOYEE(name); > select UPPER(name) from SPARK_25230 group by 1; > -- result > HASSLER{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25444) Refactor GenArrayData.genCodeToCreateArrayData() method
[ https://issues.apache.org/jira/browse/SPARK-25444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-25444. --- Resolution: Fixed Assignee: Kazuaki Ishizaki Fix Version/s: 2.5.0 Issue resolved by pull request 22439 https://github.com/apache/spark/pull/22439 > Refactor GenArrayData.genCodeToCreateArrayData() method > --- > > Key: SPARK-25444 > URL: https://issues.apache.org/jira/browse/SPARK-25444 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.5.0 > > > {{GenArrayData.genCodeToCreateArrayData()}} generated Java code to create a > temporary Java array to create {{ArrayData}}. It can be eliminated by using > {{ArrayData.createArrayData}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24360) Support Hive 3.1 metastore
[ https://issues.apache.org/jira/browse/SPARK-24360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24360: -- Description: Hive 3.1.0 is released. This issue aims to support Hive Metastore 3.1. (was: Hive 3.0.0 is released. This issue aims to support Hive Metastore 3.0.) > Support Hive 3.1 metastore > -- > > Key: SPARK-24360 > URL: https://issues.apache.org/jira/browse/SPARK-24360 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > Hive 3.1.0 is released. This issue aims to support Hive Metastore 3.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25450) PushProjectThroughUnion rule uses the same exprId for project expressions in each Union child, causing mistakes in constant propagation
[ https://issues.apache.org/jira/browse/SPARK-25450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated SPARK-25450: Issue Type: Bug (was: Improvement) > PushProjectThroughUnion rule uses the same exprId for project expressions in > each Union child, causing mistakes in constant propagation > --- > > Key: SPARK-25450 > URL: https://issues.apache.org/jira/browse/SPARK-25450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25450) PushProjectThroughUnion rule uses the same exprId for project expressions in each Union child, causing mistakes in constant propagation
Maryann Xue created SPARK-25450: --- Summary: PushProjectThroughUnion rule uses the same exprId for project expressions in each Union child, causing mistakes in constant propagation Key: SPARK-25450 URL: https://issues.apache.org/jira/browse/SPARK-25450 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Maryann Xue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25443) fix issues when building docs with release scripts in docker
[ https://issues.apache.org/jira/browse/SPARK-25443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25443. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22438 [https://github.com/apache/spark/pull/22438] > fix issues when building docs with release scripts in docker > > > Key: SPARK-25443 > URL: https://issues.apache.org/jira/browse/SPARK-25443 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25409) Speed up Spark History at start if there are tens of thousands of applications.
[ https://issues.apache.org/jira/browse/SPARK-25409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618213#comment-16618213 ] Rong Tang edited comment on SPARK-25409 at 9/17/18 10:12 PM: - Pull request created for it. [https://github.com/apache/spark/pull/22444] was (Author: trjianjianjiao): Create a pull request for it. [https://github.com/apache/spark/pull/22444] > Speed up Spark History at start if there are tens of thousands of > applications. > --- > > Key: SPARK-25409 > URL: https://issues.apache.org/jira/browse/SPARK-25409 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Rong Tang >Priority: Major > Attachments: SPARK-25409.0001.patch > > > We have a spark history server, storing 7 days' applications. it usually has > 10K to 20K attempts. > We found that it can take hours at start up,loading/replaying the logs in > event-logs folder. thus, new finished applications have to wait several > hours to be seem. So I made 2 improvements for it. > # As we run spark on yarn. the on-going applications' information can also > be seen via resource manager, so I introduce in a flag > spark.history.fs.load.incomplete to say loading logs for incomplete attempts > or not. > # Incremental loading applications. as I said, we have more then 10K > applications stored, it can take hours to load all of them at the first time. > so I introduced in a config spark.history.fs.appRefreshNum to say how many > application to load each time, then it gets a chance the check the latest > updates. > Here are the benchmark I did. our system has 1K incomplete application ( it > was not cleaned up for some reason, that is another issue that I need > investigate), and applications' log size can be gigabytes. > > Not load incomplete attempts. > | |Load Count|Load incomplete APPs|All attempts number|Time Cost|Increase > with more attempts| > |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes| > |2|All|No|13K|31 minutes| yes| > > > Limit each time how much to load. > > | |Load Count|Load incomplete APPs|All attempts number|Worst Cost|Increase > with more attempts| > |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes| > |2|3000|Yes|13K|42 minutes except last 1.6K > (The last 1.6K attempts cost extremely long 2.5 hours)|NO| > > > Limit each time how many to load, and not load incomplete jobs. > > | |Load Count|Load incomplete APPs|All attempts number|Worst > Cost|Avg|Increase with more attempts| > |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes| |Yes| > |2|3000|NO|12K|17minutes > |10 minutes > ( 41 minutes in total)|NO| > > > | |Load Count|Load incomplete APPs|All attempts number|Worst > Cost|Avg|Increase with more attempts| > |1 ( current implementation)|All|Yes|20K|1 hour 52 minutes| |Yes| > |2|3000|NO|18.5K|20minutes|18 minutes > (2 hours 18 minutes in total) > |NO| > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25409) Speed up Spark History at start if there are tens of thousands of applications.
[ https://issues.apache.org/jira/browse/SPARK-25409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618213#comment-16618213 ] Rong Tang commented on SPARK-25409: --- Create a pull request for it. [https://github.com/apache/spark/pull/22444] > Speed up Spark History at start if there are tens of thousands of > applications. > --- > > Key: SPARK-25409 > URL: https://issues.apache.org/jira/browse/SPARK-25409 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Rong Tang >Priority: Major > Attachments: SPARK-25409.0001.patch > > > We have a spark history server, storing 7 days' applications. it usually has > 10K to 20K attempts. > We found that it can take hours at start up,loading/replaying the logs in > event-logs folder. thus, new finished applications have to wait several > hours to be seem. So I made 2 improvements for it. > # As we run spark on yarn. the on-going applications' information can also > be seen via resource manager, so I introduce in a flag > spark.history.fs.load.incomplete to say loading logs for incomplete attempts > or not. > # Incremental loading applications. as I said, we have more then 10K > applications stored, it can take hours to load all of them at the first time. > so I introduced in a config spark.history.fs.appRefreshNum to say how many > application to load each time, then it gets a chance the check the latest > updates. > Here are the benchmark I did. our system has 1K incomplete application ( it > was not cleaned up for some reason, that is another issue that I need > investigate), and applications' log size can be gigabytes. > > Not load incomplete attempts. > | |Load Count|Load incomplete APPs|All attempts number|Time Cost|Increase > with more attempts| > |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes| > |2|All|No|13K|31 minutes| yes| > > > Limit each time how much to load. > > | |Load Count|Load incomplete APPs|All attempts number|Worst Cost|Increase > with more attempts| > |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes| > |2|3000|Yes|13K|42 minutes except last 1.6K > (The last 1.6K attempts cost extremely long 2.5 hours)|NO| > > > Limit each time how many to load, and not load incomplete jobs. > > | |Load Count|Load incomplete APPs|All attempts number|Worst > Cost|Avg|Increase with more attempts| > |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes| |Yes| > |2|3000|NO|12K|17minutes > |10 minutes > ( 41 minutes in total)|NO| > > > | |Load Count|Load incomplete APPs|All attempts number|Worst > Cost|Avg|Increase with more attempts| > |1 ( current implementation)|All|Yes|20K|1 hour 52 minutes| |Yes| > |2|3000|NO|18.5K|20minutes|18 minutes > (2 hours 18 minutes in total) > |NO| > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25449) Don't send zero accumulators in heartbeats
Mukul Murthy created SPARK-25449: Summary: Don't send zero accumulators in heartbeats Key: SPARK-25449 URL: https://issues.apache.org/jira/browse/SPARK-25449 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 2.4.0 Reporter: Mukul Murthy Heartbeats sent from executors to the driver every 10 seconds contain metrics and are generally on the order of a few KBs. However, for large jobs with lots of tasks, heartbeats can be on the order of tens of MBs, causing tasks to die with heartbeat failures. We can mitigate this by not sending zero metrics to the driver. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618118#comment-16618118 ] Joseph K. Bradley commented on SPARK-25321: --- [~WeichenXu123] Have you been able to look into reverting those changes or discussed with [~mengxr] about reverting them? Thanks! > ML, Graph 2.4 QA: API: New Scala APIs, docs > --- > > Key: SPARK-25321 > URL: https://issues.apache.org/jira/browse/SPARK-25321 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.4.0 >Reporter: Weichen Xu >Assignee: Yanbo Liang >Priority: Blocker > > Audit new public Scala APIs added to MLlib & GraphX. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please create JIRAs and link them to this issue. > For *user guide issues* link the new JIRAs to the relevant user guide QA issue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22036) BigDecimal multiplication sometimes returns null
[ https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618104#comment-16618104 ] Bruce Robbins commented on SPARK-22036: --- [~mgaido] In this change, you modified how precision and scale are determined when literals are promoted to decimal. For example, before the change, an integer literal's precision and scale would be hardcoded to DecimalType(10, 0). After the change, it's based on the number of digits in the literal. However, that new behavior for literals is not toggled by {{spark.sql.decimalOperations.allowPrecisionLoss}} like the other changes in behavior introduced by the PR. As a result, there are cases where we see truncation and rounding in 2.3/2.4 that we don't see in 2.2, and this change in behavior is not controllable via the configuration setting. E.g,: In 2.2: {noformat} scala> sql("select 26393499451/(1e6 * 1000) as c1").printSchema root |-- c1: decimal(27,13) (nullable = true) <== 13 decimal digits scala> sql("select 26393499451/(1e6 * 1000) as c1").show ++ | c1| ++ |26.393499451| ++ {noformat} In 2.3 and up: {noformat} scala> sql("set spark.sql.decimalOperations.allowPrecisionLoss").show ++-+ | key|value| ++-+ |spark.sql.decimal...| true| ++-+ scala> sql("select 26393499451/(1e6 * 1000) as c1").printSchema root |-- c1: decimal(12,7) (nullable = true) scala> sql("select 26393499451/(1e6 * 1000) as c1").show +--+ |c1| +--+ |26.3934995| <== result is truncated and rounded up. +--+ scala> sql("set spark.sql.decimalOperations.allowPrecisionLoss=false").show ++-+ | key|value| ++-+ |spark.sql.decimal...|false| ++-+ scala> sql("select 26393499451/(1e6 * 1000) as c1").printSchema root |-- c1: decimal(12,7) (nullable = true) scala> sql("select 26393499451/(1e6 * 1000) as c1").show +--+ |c1| +--+ |26.3934995| <== result is still truncated and rounded up. +--+ scala> {noformat} I can force it to behave the old way, at least for this case, by explicitly casting the literal: {noformat} scala> sql("select 26393499451/(1e6 * cast(1000 as decimal(10, 0))) as c1").show ++ | c1| ++ |26.393499451| ++ {noformat} Do you think it makes sense for {{spark.sql.decimalOperations.allowPrecisionLoss}} to also toggle how literal promotion happens (the old way vs. the new way)? > BigDecimal multiplication sometimes returns null > > > Key: SPARK-22036 > URL: https://issues.apache.org/jira/browse/SPARK-22036 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Olivier Blanvillain >Assignee: Marco Gaido >Priority: Major > Fix For: 2.3.0 > > > The multiplication of two BigDecimal numbers sometimes returns null. Here is > a minimal reproduction: > {code:java} > object Main extends App { > import org.apache.spark.{SparkConf, SparkContext} > import org.apache.spark.sql.SparkSession > import spark.implicits._ > val conf = new > SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", > "false") > val spark = > SparkSession.builder().config(conf).appName("REPL").getOrCreate() > implicit val sqlContext = spark.sqlContext > case class X2(a: BigDecimal, b: BigDecimal) > val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), > BigDecimal(-1000.1 > val result = ds.select(ds("a") * ds("b")).collect.head > println(result) // [null] > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks
[ https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618098#comment-16618098 ] Sandish Kumar HN commented on SPARK-20760: -- I do see the issue in spark 2.1.1 & 2.2.0 and I was able to replicate the issue with above code snippets. > Memory Leak of RDD blocks > -- > > Key: SPARK-20760 > URL: https://issues.apache.org/jira/browse/SPARK-20760 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 >Reporter: Binzi Cao >Priority: Major > Attachments: RDD Blocks .png, RDD blocks in spark 2.1.1.png, Storage > in spark 2.1.1.png > > > Memory leak for RDD blocks for a long time running rdd process. > We have a long term running application, which is doing computations of > RDDs. and we found the RDD blocks are keep increasing in the spark ui page. > The rdd blocks and memory usage do not mach the cached rdds and memory. It > looks like spark keeps old rdd in memory and never released it or never got a > chance to release it. The job will eventually die of out of memory. > In addition, I'm not seeing this issue in spark 1.6. We are seeing the same > issue in Yarn Cluster mode both in kafka streaming and batch applications. > The issue in streaming is similar, however, it seems the rdd blocks grows a > bit slower than batch jobs. > The below is the sample code and it is reproducible by justing running it in > local mode. > Scala file: > {code} > import scala.concurrent.duration.Duration > import scala.util.{Try, Failure, Success} > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.rdd.RDD > import scala.concurrent._ > import ExecutionContext.Implicits.global > case class Person(id: String, name: String) > object RDDApp { > def run(sc: SparkContext) = { > while (true) { > val r = scala.util.Random > val data = (1 to r.nextInt(100)).toList.map { a => > Person(a.toString, a.toString) > } > val rdd = sc.parallelize(data) > rdd.cache > println("running") > val a = (1 to 100).toList.map { x => > Future(rdd.filter(_.id == x.toString).collect) > } > a.foreach { f => > println(Await.ready(f, Duration.Inf).value.get) > } > rdd.unpersist() > } > } > def main(args: Array[String]): Unit = { >val conf = new SparkConf().setAppName("test") > val sc = new SparkContext(conf) > run(sc) > } > } > {code} > build sbt file: > {code} > name := "RDDTest" > version := "0.1.1" > scalaVersion := "2.11.5" > libraryDependencies ++= Seq ( > "org.scalaz" %% "scalaz-core" % "7.2.0", > "org.scalaz" %% "scalaz-concurrent" % "7.2.0", > "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided", > "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided" > ) > addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1") > mainClass in assembly := Some("RDDApp") > test in assembly := {} > {code} > To reproduce it: > Just > {code} > spark-2.1.0-bin-hadoop2.7/bin/spark-submit --driver-memory 4G \ > --executor-memory 4G \ > --executor-cores 1 \ > --num-executors 1 \ > --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20074) Make buffer size in unsafe external sorter configurable
[ https://issues.apache.org/jira/browse/SPARK-20074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618061#comment-16618061 ] Kevin English commented on SPARK-20074: --- I found this issue from external content that indicated that this limits IO write block sizes, for instance for parquet files following a N:1 re-partitioning. Can someone confirm that being able to radically increase this value would reduce spilling when aggregating a large number of small files? > Make buffer size in unsafe external sorter configurable > --- > > Key: SPARK-20074 > URL: https://issues.apache.org/jira/browse/SPARK-20074 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Sital Kedia >Priority: Major > > Currently, it is hardcoded to 32kb, see - > https://github.com/sitalkedia/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L123 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16323) Avoid unnecessary cast when doing integral divide
[ https://issues.apache.org/jira/browse/SPARK-16323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-16323. --- Resolution: Fixed Issue resolved by pull request 22395 [https://github.com/apache/spark/pull/22395] > Avoid unnecessary cast when doing integral divide > - > > Key: SPARK-16323 > URL: https://issues.apache.org/jira/browse/SPARK-16323 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Sean Zhong >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.5.0 > > > This is a follow up of issue SPARK-15776 > *Problem:* > For Integer divide operator div: > {code} > scala> spark.sql("select 6 div 3").explain(true) > ... > == Analyzed Logical Plan == > CAST((6 / 3) AS BIGINT): bigint > Project [cast((cast(6 as double) / cast(3 as double)) as bigint) AS CAST((6 / > 3) AS BIGINT)#5L] > +- OneRowRelation$ > ... > {code} > For performance reason, we should not do unnecessary cast {{cast(xx as > double)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16323) Avoid unnecessary cast when doing integral divide
[ https://issues.apache.org/jira/browse/SPARK-16323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-16323: - Assignee: Marco Gaido > Avoid unnecessary cast when doing integral divide > - > > Key: SPARK-16323 > URL: https://issues.apache.org/jira/browse/SPARK-16323 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Sean Zhong >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.5.0 > > > This is a follow up of issue SPARK-15776 > *Problem:* > For Integer divide operator div: > {code} > scala> spark.sql("select 6 div 3").explain(true) > ... > == Analyzed Logical Plan == > CAST((6 / 3) AS BIGINT): bigint > Project [cast((cast(6 as double) / cast(3 as double)) as bigint) AS CAST((6 / > 3) AS BIGINT)#5L] > +- OneRowRelation$ > ... > {code} > For performance reason, we should not do unnecessary cast {{cast(xx as > double)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16323) Avoid unnecessary cast when doing integral divide
[ https://issues.apache.org/jira/browse/SPARK-16323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16323: -- Affects Version/s: 2.5.0 > Avoid unnecessary cast when doing integral divide > - > > Key: SPARK-16323 > URL: https://issues.apache.org/jira/browse/SPARK-16323 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Sean Zhong >Priority: Minor > Fix For: 2.5.0 > > > This is a follow up of issue SPARK-15776 > *Problem:* > For Integer divide operator div: > {code} > scala> spark.sql("select 6 div 3").explain(true) > ... > == Analyzed Logical Plan == > CAST((6 / 3) AS BIGINT): bigint > Project [cast((cast(6 as double) / cast(3 as double)) as bigint) AS CAST((6 / > 3) AS BIGINT)#5L] > +- OneRowRelation$ > ... > {code} > For performance reason, we should not do unnecessary cast {{cast(xx as > double)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16323) Avoid unnecessary cast when doing integral divide
[ https://issues.apache.org/jira/browse/SPARK-16323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16323: -- Fix Version/s: 2.5.0 > Avoid unnecessary cast when doing integral divide > - > > Key: SPARK-16323 > URL: https://issues.apache.org/jira/browse/SPARK-16323 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Sean Zhong >Priority: Minor > Fix For: 2.5.0 > > > This is a follow up of issue SPARK-15776 > *Problem:* > For Integer divide operator div: > {code} > scala> spark.sql("select 6 div 3").explain(true) > ... > == Analyzed Logical Plan == > CAST((6 / 3) AS BIGINT): bigint > Project [cast((cast(6 as double) / cast(3 as double)) as bigint) AS CAST((6 / > 3) AS BIGINT)#5L] > +- OneRowRelation$ > ... > {code} > For performance reason, we should not do unnecessary cast {{cast(xx as > double)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata
[ https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25423. --- Resolution: Fixed Fix Version/s: 2.5.0 Issue resolved by pull request 22435 [https://github.com/apache/spark/pull/22435] > Output "dataFilters" in DataSourceScanExec.metadata > --- > > Key: SPARK-25423 > URL: https://issues.apache.org/jira/browse/SPARK-25423 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Maryann Xue >Assignee: Yuming Wang >Priority: Trivial > Labels: starter > Fix For: 2.5.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25339) Refactor FilterPushdownBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25339: Assignee: (was: Apache Spark) > Refactor FilterPushdownBenchmark to use main method > --- > > Key: SPARK-25339 > URL: https://issues.apache.org/jira/browse/SPARK-25339 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > Wenchen commented on the PR: > https://github.com/apache/spark/pull/22336#issuecomment-418604019 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25339) Refactor FilterPushdownBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617853#comment-16617853 ] Apache Spark commented on SPARK-25339: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/22443 > Refactor FilterPushdownBenchmark to use main method > --- > > Key: SPARK-25339 > URL: https://issues.apache.org/jira/browse/SPARK-25339 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > Wenchen commented on the PR: > https://github.com/apache/spark/pull/22336#issuecomment-418604019 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25339) Refactor FilterPushdownBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25339: Assignee: Apache Spark > Refactor FilterPushdownBenchmark to use main method > --- > > Key: SPARK-25339 > URL: https://issues.apache.org/jira/browse/SPARK-25339 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > Wenchen commented on the PR: > https://github.com/apache/spark/pull/22336#issuecomment-418604019 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23906) Add UDF trunc(numeric)
[ https://issues.apache.org/jira/browse/SPARK-23906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617828#comment-16617828 ] Apache Spark commented on SPARK-23906: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/22419 > Add UDF trunc(numeric) > -- > > Key: SPARK-23906 > URL: https://issues.apache.org/jira/browse/SPARK-23906 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Yuming Wang >Priority: Major > > https://issues.apache.org/jira/browse/HIVE-14582 > We already have {{date_trunc}} and {{trunc}}. Need to discuss whether we > should introduce a new name or reuse {{trunc}} for truncating numbers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25442) Support STS to run in K8S deployment with spark deployment mode as cluster
[ https://issues.apache.org/jira/browse/SPARK-25442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617831#comment-16617831 ] Apache Spark commented on SPARK-25442: -- User 'suryag10' has created a pull request for this issue: https://github.com/apache/spark/pull/22433 > Support STS to run in K8S deployment with spark deployment mode as cluster > -- > > Key: SPARK-25442 > URL: https://issues.apache.org/jira/browse/SPARK-25442 > Project: Spark > Issue Type: Bug > Components: Kubernetes, SQL >Affects Versions: 2.4.0, 2.5.0 >Reporter: Suryanarayana Garlapati >Priority: Major > > STS fails to start in kubernetes deployments with spark deploy mode as > cluster. Support should be added to make it run in K8S deployments. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25430) Add map parameter for withColumnRenamed
[ https://issues.apache.org/jira/browse/SPARK-25430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25430: Assignee: (was: Apache Spark) > Add map parameter for withColumnRenamed > --- > > Key: SPARK-25430 > URL: https://issues.apache.org/jira/browse/SPARK-25430 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Goun Na >Priority: Major > > WithColumnRenamed method should work with map parameter. It removes code > redundancy. > {code:java} > // example > df.withColumnRenamed(Map( "c1" -> "first_column", "c2" -> "second_column" > )){code} > {code:java} > // from abbr columns to desc columns > val m = Map( "c1" -> "first_column", "c2" -> "second_column" ) > df1.withColumnRenamed(m) > df2.withColumnRenamed(m) > {code} > It is useful for CJK users when they are working on analysis in notebook > environment such as Zeppelin, Databricks, Apache Toree. > {code:java} > // for CJK users once define dictionary into map, reuse column map to > translate columns whenever report visualization is required > val m = Map( "c1" -> "컬럼_1", "c2" -> "컬럼_2") > df1.withColumnRenamed(m) > df2.withColumnRenamed(m) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25424) Window duration and slide duration with negative values should fail fast
[ https://issues.apache.org/jira/browse/SPARK-25424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617832#comment-16617832 ] Apache Spark commented on SPARK-25424: -- User 'raghavgautam' has created a pull request for this issue: https://github.com/apache/spark/pull/22414 > Window duration and slide duration with negative values should fail fast > > > Key: SPARK-25424 > URL: https://issues.apache.org/jira/browse/SPARK-25424 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Raghav Kumar Gautam >Priority: Major > Fix For: 2.4.0 > > > In TimeWindow class window duration and slide duration should not be allowed > to take negative values. > Currently this behaviour enforced by catalyst. It can be enforced by > constructor of TimeWindow allowing it to fail fast. > For e.g. the code below throws following error. Note that the error is > produced at the time of count() call instead of window() call. > {code:java} > val df = spark.readStream > .format("rate") > .option("numPartitions", "2") > .option("rowsPerSecond", "10") > .load() > .filter("value % 20 == 0") > .withWatermark("timestamp", "10 seconds") > .groupBy(window($"timestamp", "-10 seconds", "5 seconds")) > .count() > {code} > Error: > {code:java} > cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data > type mismatch: The window duration (-1000) must be greater than 0.;; > 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], > [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, > count(1) AS count#57L] > +- AnalysisBarrier > +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds > +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) > +- StreamingRelationV2 > org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, > Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], > StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond > -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] > org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, > -1000, 500, 0)' due to data type mismatch: The window duration > (-1000) must be greater than 0.;; > 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], > [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, > count(1) AS count#57L] > +- AnalysisBarrier > +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds > +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) > +- StreamingRelationV2 > org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, > Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], > StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond > -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118) > at >
[jira] [Assigned] (SPARK-25440) Dump query execution info to a file
[ https://issues.apache.org/jira/browse/SPARK-25440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25440: Assignee: Apache Spark > Dump query execution info to a file > --- > > Key: SPARK-25440 > URL: https://issues.apache.org/jira/browse/SPARK-25440 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > Output of the explain() doesn't contain full information and in some cases > can be truncated. Besides of that it saves info to a string in memory which > can cause OOM. The ticket aims to solve the problem and dump info about query > execution to a file. Need to add new method to queryExecution.debug which > accepts a path to a file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23367) Include python document style checking
[ https://issues.apache.org/jira/browse/SPARK-23367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617827#comment-16617827 ] Apache Spark commented on SPARK-23367: -- User 'rekhajoshm' has created a pull request for this issue: https://github.com/apache/spark/pull/22425 > Include python document style checking > -- > > Key: SPARK-23367 > URL: https://issues.apache.org/jira/browse/SPARK-23367 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.1 >Reporter: Rekha Joshi >Priority: Minor > > As per discussions [PR#20378 |https://github.com/apache/spark/pull/20378] > this jira is to include python doc style checking in spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25442) Support STS to run in K8S deployment with spark deployment mode as cluster
[ https://issues.apache.org/jira/browse/SPARK-25442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25442: Assignee: Apache Spark > Support STS to run in K8S deployment with spark deployment mode as cluster > -- > > Key: SPARK-25442 > URL: https://issues.apache.org/jira/browse/SPARK-25442 > Project: Spark > Issue Type: Bug > Components: Kubernetes, SQL >Affects Versions: 2.4.0, 2.5.0 >Reporter: Suryanarayana Garlapati >Assignee: Apache Spark >Priority: Major > > STS fails to start in kubernetes deployments with spark deploy mode as > cluster. Support should be added to make it run in K8S deployments. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25430) Add map parameter for withColumnRenamed
[ https://issues.apache.org/jira/browse/SPARK-25430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25430: Assignee: Apache Spark > Add map parameter for withColumnRenamed > --- > > Key: SPARK-25430 > URL: https://issues.apache.org/jira/browse/SPARK-25430 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Goun Na >Assignee: Apache Spark >Priority: Major > > WithColumnRenamed method should work with map parameter. It removes code > redundancy. > {code:java} > // example > df.withColumnRenamed(Map( "c1" -> "first_column", "c2" -> "second_column" > )){code} > {code:java} > // from abbr columns to desc columns > val m = Map( "c1" -> "first_column", "c2" -> "second_column" ) > df1.withColumnRenamed(m) > df2.withColumnRenamed(m) > {code} > It is useful for CJK users when they are working on analysis in notebook > environment such as Zeppelin, Databricks, Apache Toree. > {code:java} > // for CJK users once define dictionary into map, reuse column map to > translate columns whenever report visualization is required > val m = Map( "c1" -> "컬럼_1", "c2" -> "컬럼_2") > df1.withColumnRenamed(m) > df2.withColumnRenamed(m) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25429) SparkListenerBus inefficient due to 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure
[ https://issues.apache.org/jira/browse/SPARK-25429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25429: Assignee: Apache Spark > SparkListenerBus inefficient due to > 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure > > > Key: SPARK-25429 > URL: https://issues.apache.org/jira/browse/SPARK-25429 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: DENG FEI >Assignee: Apache Spark >Priority: Major > > {code:java} > private def updateStageMetrics( > stageId: Int, > attemptId: Int, > taskId: Long, > accumUpdates: Seq[AccumulableInfo], > succeeded: Boolean): Unit = { > Option(stageMetrics.get(stageId)).foreach { metrics => > if (metrics.attemptId != attemptId || metrics.accumulatorIds.isEmpty) { > return > } > val oldTaskMetrics = metrics.taskMetrics.get(taskId) > if (oldTaskMetrics != null && oldTaskMetrics.succeeded) { > return > } > val updates = accumUpdates > .filter { acc => acc.update.isDefined && > metrics.accumulatorIds.contains(acc.id) } > .sortBy(_.id) > if (updates.isEmpty) { > return > } > val ids = new Array[Long](updates.size) > val values = new Array[Long](updates.size) > updates.zipWithIndex.foreach { case (acc, idx) => > ids(idx) = acc.id > // In a live application, accumulators have Long values, but when > reading from event > // logs, they have String values. For now, assume all accumulators > are Long and covert > // accordingly. > values(idx) = acc.update.get match { > case s: String => s.toLong > case l: Long => l > case o => throw new IllegalArgumentException(s"Unexpected: $o") > } > } > // TODO: storing metrics by task ID can cause metrics for the same task > index to be > // counted multiple times, for example due to speculation or > re-attempts. > metrics.taskMetrics.put(taskId, new LiveTaskMetrics(ids, values, > succeeded)) > } > } > {code} > 'metrics.accumulatorIds.contains(acc.id)', if large SQL application generated > many accumulator, it's inefficient use Arrray#contains. > Actually, application may timeout while quit and will killed by RM on YARN > mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25440) Dump query execution info to a file
[ https://issues.apache.org/jira/browse/SPARK-25440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617830#comment-16617830 ] Apache Spark commented on SPARK-25440: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/22429 > Dump query execution info to a file > --- > > Key: SPARK-25440 > URL: https://issues.apache.org/jira/browse/SPARK-25440 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Output of the explain() doesn't contain full information and in some cases > can be truncated. Besides of that it saves info to a string in memory which > can cause OOM. The ticket aims to solve the problem and dump info about query > execution to a file. Need to add new method to queryExecution.debug which > accepts a path to a file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25442) Support STS to run in K8S deployment with spark deployment mode as cluster
[ https://issues.apache.org/jira/browse/SPARK-25442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25442: Assignee: (was: Apache Spark) > Support STS to run in K8S deployment with spark deployment mode as cluster > -- > > Key: SPARK-25442 > URL: https://issues.apache.org/jira/browse/SPARK-25442 > Project: Spark > Issue Type: Bug > Components: Kubernetes, SQL >Affects Versions: 2.4.0, 2.5.0 >Reporter: Suryanarayana Garlapati >Priority: Major > > STS fails to start in kubernetes deployments with spark deploy mode as > cluster. Support should be added to make it run in K8S deployments. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25440) Dump query execution info to a file
[ https://issues.apache.org/jira/browse/SPARK-25440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25440: Assignee: (was: Apache Spark) > Dump query execution info to a file > --- > > Key: SPARK-25440 > URL: https://issues.apache.org/jira/browse/SPARK-25440 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Output of the explain() doesn't contain full information and in some cases > can be truncated. Besides of that it saves info to a string in memory which > can cause OOM. The ticket aims to solve the problem and dump info about query > execution to a file. Need to add new method to queryExecution.debug which > accepts a path to a file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25429) SparkListenerBus inefficient due to 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure
[ https://issues.apache.org/jira/browse/SPARK-25429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25429: Assignee: (was: Apache Spark) > SparkListenerBus inefficient due to > 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure > > > Key: SPARK-25429 > URL: https://issues.apache.org/jira/browse/SPARK-25429 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: DENG FEI >Priority: Major > > {code:java} > private def updateStageMetrics( > stageId: Int, > attemptId: Int, > taskId: Long, > accumUpdates: Seq[AccumulableInfo], > succeeded: Boolean): Unit = { > Option(stageMetrics.get(stageId)).foreach { metrics => > if (metrics.attemptId != attemptId || metrics.accumulatorIds.isEmpty) { > return > } > val oldTaskMetrics = metrics.taskMetrics.get(taskId) > if (oldTaskMetrics != null && oldTaskMetrics.succeeded) { > return > } > val updates = accumUpdates > .filter { acc => acc.update.isDefined && > metrics.accumulatorIds.contains(acc.id) } > .sortBy(_.id) > if (updates.isEmpty) { > return > } > val ids = new Array[Long](updates.size) > val values = new Array[Long](updates.size) > updates.zipWithIndex.foreach { case (acc, idx) => > ids(idx) = acc.id > // In a live application, accumulators have Long values, but when > reading from event > // logs, they have String values. For now, assume all accumulators > are Long and covert > // accordingly. > values(idx) = acc.update.get match { > case s: String => s.toLong > case l: Long => l > case o => throw new IllegalArgumentException(s"Unexpected: $o") > } > } > // TODO: storing metrics by task ID can cause metrics for the same task > index to be > // counted multiple times, for example due to speculation or > re-attempts. > metrics.taskMetrics.put(taskId, new LiveTaskMetrics(ids, values, > succeeded)) > } > } > {code} > 'metrics.accumulatorIds.contains(acc.id)', if large SQL application generated > many accumulator, it's inefficient use Arrray#contains. > Actually, application may timeout while quit and will killed by RM on YARN > mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23906) Add UDF trunc(numeric)
[ https://issues.apache.org/jira/browse/SPARK-23906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23906: Assignee: Yuming Wang (was: Apache Spark) > Add UDF trunc(numeric) > -- > > Key: SPARK-23906 > URL: https://issues.apache.org/jira/browse/SPARK-23906 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Yuming Wang >Priority: Major > > https://issues.apache.org/jira/browse/HIVE-14582 > We already have {{date_trunc}} and {{trunc}}. Need to discuss whether we > should introduce a new name or reuse {{trunc}} for truncating numbers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25429) SparkListenerBus inefficient due to 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure
[ https://issues.apache.org/jira/browse/SPARK-25429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617825#comment-16617825 ] Apache Spark commented on SPARK-25429: -- User 'hellodengfei' has created a pull request for this issue: https://github.com/apache/spark/pull/22420 > SparkListenerBus inefficient due to > 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure > > > Key: SPARK-25429 > URL: https://issues.apache.org/jira/browse/SPARK-25429 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: DENG FEI >Priority: Major > > {code:java} > private def updateStageMetrics( > stageId: Int, > attemptId: Int, > taskId: Long, > accumUpdates: Seq[AccumulableInfo], > succeeded: Boolean): Unit = { > Option(stageMetrics.get(stageId)).foreach { metrics => > if (metrics.attemptId != attemptId || metrics.accumulatorIds.isEmpty) { > return > } > val oldTaskMetrics = metrics.taskMetrics.get(taskId) > if (oldTaskMetrics != null && oldTaskMetrics.succeeded) { > return > } > val updates = accumUpdates > .filter { acc => acc.update.isDefined && > metrics.accumulatorIds.contains(acc.id) } > .sortBy(_.id) > if (updates.isEmpty) { > return > } > val ids = new Array[Long](updates.size) > val values = new Array[Long](updates.size) > updates.zipWithIndex.foreach { case (acc, idx) => > ids(idx) = acc.id > // In a live application, accumulators have Long values, but when > reading from event > // logs, they have String values. For now, assume all accumulators > are Long and covert > // accordingly. > values(idx) = acc.update.get match { > case s: String => s.toLong > case l: Long => l > case o => throw new IllegalArgumentException(s"Unexpected: $o") > } > } > // TODO: storing metrics by task ID can cause metrics for the same task > index to be > // counted multiple times, for example due to speculation or > re-attempts. > metrics.taskMetrics.put(taskId, new LiveTaskMetrics(ids, values, > succeeded)) > } > } > {code} > 'metrics.accumulatorIds.contains(acc.id)', if large SQL application generated > many accumulator, it's inefficient use Arrray#contains. > Actually, application may timeout while quit and will killed by RM on YARN > mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25303) A DStream that is checkpointed should allow its parent(s) to be removed and not persisted
[ https://issues.apache.org/jira/browse/SPARK-25303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25303: Assignee: (was: Apache Spark) > A DStream that is checkpointed should allow its parent(s) to be removed and > not persisted > - > > Key: SPARK-25303 > URL: https://issues.apache.org/jira/browse/SPARK-25303 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, > 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Nikunj Bansal >Priority: Major > Labels: Streaming, streaming > > A checkpointed DStream is supposed to cut the lineage to its parent(s) such > that any persisted RDDs for the parent(s) are removed. However, combined with > the issue in SPARK-25302, they result in the Input Stream RDDs being > persisted a lot longer than they are actually required. > See also related bug SPARK-25302. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25424) Window duration and slide duration with negative values should fail fast
[ https://issues.apache.org/jira/browse/SPARK-25424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25424: Assignee: (was: Apache Spark) > Window duration and slide duration with negative values should fail fast > > > Key: SPARK-25424 > URL: https://issues.apache.org/jira/browse/SPARK-25424 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Raghav Kumar Gautam >Priority: Major > Fix For: 2.4.0 > > > In TimeWindow class window duration and slide duration should not be allowed > to take negative values. > Currently this behaviour enforced by catalyst. It can be enforced by > constructor of TimeWindow allowing it to fail fast. > For e.g. the code below throws following error. Note that the error is > produced at the time of count() call instead of window() call. > {code:java} > val df = spark.readStream > .format("rate") > .option("numPartitions", "2") > .option("rowsPerSecond", "10") > .load() > .filter("value % 20 == 0") > .withWatermark("timestamp", "10 seconds") > .groupBy(window($"timestamp", "-10 seconds", "5 seconds")) > .count() > {code} > Error: > {code:java} > cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data > type mismatch: The window duration (-1000) must be greater than 0.;; > 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], > [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, > count(1) AS count#57L] > +- AnalysisBarrier > +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds > +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) > +- StreamingRelationV2 > org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, > Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], > StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond > -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] > org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, > -1000, 500, 0)' due to data type mismatch: The window duration > (-1000) must be greater than 0.;; > 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], > [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, > count(1) AS count#57L] > +- AnalysisBarrier > +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds > +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) > +- StreamingRelationV2 > org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, > Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], > StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond > -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:122) > at >
[jira] [Assigned] (SPARK-23906) Add UDF trunc(numeric)
[ https://issues.apache.org/jira/browse/SPARK-23906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23906: Assignee: Apache Spark (was: Yuming Wang) > Add UDF trunc(numeric) > -- > > Key: SPARK-23906 > URL: https://issues.apache.org/jira/browse/SPARK-23906 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > https://issues.apache.org/jira/browse/HIVE-14582 > We already have {{date_trunc}} and {{trunc}}. Need to discuss whether we > should introduce a new name or reuse {{trunc}} for truncating numbers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25303) A DStream that is checkpointed should allow its parent(s) to be removed and not persisted
[ https://issues.apache.org/jira/browse/SPARK-25303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617829#comment-16617829 ] Apache Spark commented on SPARK-25303: -- User 'nikunjb' has created a pull request for this issue: https://github.com/apache/spark/pull/22424 > A DStream that is checkpointed should allow its parent(s) to be removed and > not persisted > - > > Key: SPARK-25303 > URL: https://issues.apache.org/jira/browse/SPARK-25303 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, > 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Nikunj Bansal >Priority: Major > Labels: Streaming, streaming > > A checkpointed DStream is supposed to cut the lineage to its parent(s) such > that any persisted RDDs for the parent(s) are removed. However, combined with > the issue in SPARK-25302, they result in the Input Stream RDDs being > persisted a lot longer than they are actually required. > See also related bug SPARK-25302. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25430) Add map parameter for withColumnRenamed
[ https://issues.apache.org/jira/browse/SPARK-25430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617824#comment-16617824 ] Apache Spark commented on SPARK-25430: -- User 'goungoun' has created a pull request for this issue: https://github.com/apache/spark/pull/22428 > Add map parameter for withColumnRenamed > --- > > Key: SPARK-25430 > URL: https://issues.apache.org/jira/browse/SPARK-25430 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Goun Na >Priority: Major > > WithColumnRenamed method should work with map parameter. It removes code > redundancy. > {code:java} > // example > df.withColumnRenamed(Map( "c1" -> "first_column", "c2" -> "second_column" > )){code} > {code:java} > // from abbr columns to desc columns > val m = Map( "c1" -> "first_column", "c2" -> "second_column" ) > df1.withColumnRenamed(m) > df2.withColumnRenamed(m) > {code} > It is useful for CJK users when they are working on analysis in notebook > environment such as Zeppelin, Databricks, Apache Toree. > {code:java} > // for CJK users once define dictionary into map, reuse column map to > translate columns whenever report visualization is required > val m = Map( "c1" -> "컬럼_1", "c2" -> "컬럼_2") > df1.withColumnRenamed(m) > df2.withColumnRenamed(m) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25303) A DStream that is checkpointed should allow its parent(s) to be removed and not persisted
[ https://issues.apache.org/jira/browse/SPARK-25303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25303: Assignee: Apache Spark > A DStream that is checkpointed should allow its parent(s) to be removed and > not persisted > - > > Key: SPARK-25303 > URL: https://issues.apache.org/jira/browse/SPARK-25303 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, > 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Nikunj Bansal >Assignee: Apache Spark >Priority: Major > Labels: Streaming, streaming > > A checkpointed DStream is supposed to cut the lineage to its parent(s) such > that any persisted RDDs for the parent(s) are removed. However, combined with > the issue in SPARK-25302, they result in the Input Stream RDDs being > persisted a lot longer than they are actually required. > See also related bug SPARK-25302. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25424) Window duration and slide duration with negative values should fail fast
[ https://issues.apache.org/jira/browse/SPARK-25424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25424: Assignee: Apache Spark > Window duration and slide duration with negative values should fail fast > > > Key: SPARK-25424 > URL: https://issues.apache.org/jira/browse/SPARK-25424 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Raghav Kumar Gautam >Assignee: Apache Spark >Priority: Major > Fix For: 2.4.0 > > > In TimeWindow class window duration and slide duration should not be allowed > to take negative values. > Currently this behaviour enforced by catalyst. It can be enforced by > constructor of TimeWindow allowing it to fail fast. > For e.g. the code below throws following error. Note that the error is > produced at the time of count() call instead of window() call. > {code:java} > val df = spark.readStream > .format("rate") > .option("numPartitions", "2") > .option("rowsPerSecond", "10") > .load() > .filter("value % 20 == 0") > .withWatermark("timestamp", "10 seconds") > .groupBy(window($"timestamp", "-10 seconds", "5 seconds")) > .count() > {code} > Error: > {code:java} > cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data > type mismatch: The window duration (-1000) must be greater than 0.;; > 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], > [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, > count(1) AS count#57L] > +- AnalysisBarrier > +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds > +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) > +- StreamingRelationV2 > org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, > Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], > StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond > -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] > org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, > -1000, 500, 0)' due to data type mismatch: The window duration > (-1000) must be greater than 0.;; > 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], > [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, > count(1) AS count#57L] > +- AnalysisBarrier > +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds > +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) > +- StreamingRelationV2 > org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, > Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], > StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond > -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118) > at >
[jira] [Commented] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617823#comment-16617823 ] Apache Spark commented on SPARK-25433: -- User 'fhoering' has created a pull request for this issue: https://github.com/apache/spark/pull/22422 > Add support for PEX in PySpark > -- > > Key: SPARK-25433 > URL: https://issues.apache.org/jira/browse/SPARK-25433 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.2 >Reporter: Fabian Höring >Priority: Minor > > The goal of this ticket is to ship and use custom code inside the spark > executors using [PEX|https://github.com/pantsbuild/pex] > This currently works fine with > [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] > (disadvantages are that you have a separate conda package repo and ship the > python interpreter all the time) > Basically the workflow is > * to zip the local conda environment ([conda > pack|https://github.com/conda/conda-pack] also works) > * ship it to each executor as an archive > * modify PYSPARK_PYTHON to the local conda environment > I think it can work the same way with virtual env. There is the SPARK-13587 > ticket to provide nice entry points to spark-submit and SparkContext but > zipping your local virtual env and then just changing PYSPARK_PYTHON env > variable should already work. > I also have seen this > [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. > But recreating the virtual env each time doesn't seem to be a very scalable > solution. If you have hundreds of executors it will retrieve the packages on > each excecutor and recreate your virtual environment each time. Same problem > with this proposal SPARK-16367 from what I understood. > Another problem with virtual env is that your local environment is not easily > shippable to another machine. In particular there is the relocatable option > (see > [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], > > [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] > which makes it very complicated for the user to ship the virtual env and be > sure it works. > And here is where pex comes in. It is a nice way to create a single > executable zip file with all dependencies included. You have the pex command > line tool to build your package and when it is built you are sure it works. > This is in my opinion the most elegant way to ship python code (better than > virtual env and conda) > The problem why it doesn't work out of the box is that there can be only one > single entry point. So just shipping the pex files and setting PYSPARK_PYTHON > to the pex files doesn't work. You can nevertheless tune the env variable > [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] > and runtime to provide different entry points. > PR: [https://github.com/apache/spark/pull/22422/files] > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata
[ https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25423: Assignee: Apache Spark (was: Yuming Wang) > Output "dataFilters" in DataSourceScanExec.metadata > --- > > Key: SPARK-25423 > URL: https://issues.apache.org/jira/browse/SPARK-25423 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Maryann Xue >Assignee: Apache Spark >Priority: Trivial > Labels: starter > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25433: Assignee: (was: Apache Spark) > Add support for PEX in PySpark > -- > > Key: SPARK-25433 > URL: https://issues.apache.org/jira/browse/SPARK-25433 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.2 >Reporter: Fabian Höring >Priority: Minor > > The goal of this ticket is to ship and use custom code inside the spark > executors using [PEX|https://github.com/pantsbuild/pex] > This currently works fine with > [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] > (disadvantages are that you have a separate conda package repo and ship the > python interpreter all the time) > Basically the workflow is > * to zip the local conda environment ([conda > pack|https://github.com/conda/conda-pack] also works) > * ship it to each executor as an archive > * modify PYSPARK_PYTHON to the local conda environment > I think it can work the same way with virtual env. There is the SPARK-13587 > ticket to provide nice entry points to spark-submit and SparkContext but > zipping your local virtual env and then just changing PYSPARK_PYTHON env > variable should already work. > I also have seen this > [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. > But recreating the virtual env each time doesn't seem to be a very scalable > solution. If you have hundreds of executors it will retrieve the packages on > each excecutor and recreate your virtual environment each time. Same problem > with this proposal SPARK-16367 from what I understood. > Another problem with virtual env is that your local environment is not easily > shippable to another machine. In particular there is the relocatable option > (see > [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], > > [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] > which makes it very complicated for the user to ship the virtual env and be > sure it works. > And here is where pex comes in. It is a nice way to create a single > executable zip file with all dependencies included. You have the pex command > line tool to build your package and when it is built you are sure it works. > This is in my opinion the most elegant way to ship python code (better than > virtual env and conda) > The problem why it doesn't work out of the box is that there can be only one > single entry point. So just shipping the pex files and setting PYSPARK_PYTHON > to the pex files doesn't work. You can nevertheless tune the env variable > [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] > and runtime to provide different entry points. > PR: [https://github.com/apache/spark/pull/22422/files] > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25302) ReducedWindowedDStream not using checkpoints for reduced RDDs
[ https://issues.apache.org/jira/browse/SPARK-25302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25302: Assignee: Apache Spark > ReducedWindowedDStream not using checkpoints for reduced RDDs > - > > Key: SPARK-25302 > URL: https://issues.apache.org/jira/browse/SPARK-25302 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, > 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Nikunj Bansal >Assignee: Apache Spark >Priority: Major > Labels: Streaming, streaming > > When using reduceByKeyAndWindow() using inverse reduce function, it > eventually creates a ReducedWindowedDStream. This class creates a > reducedDStream but only persists it and does not checkpoint it. The result is > that it ends up using cached RDDs and does not cut lineage to the input > DStream resulting in eventually caching the input RDDs for much longer than > they are needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata
[ https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25423: Assignee: Yuming Wang (was: Apache Spark) > Output "dataFilters" in DataSourceScanExec.metadata > --- > > Key: SPARK-25423 > URL: https://issues.apache.org/jira/browse/SPARK-25423 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Maryann Xue >Assignee: Yuming Wang >Priority: Trivial > Labels: starter > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25302) ReducedWindowedDStream not using checkpoints for reduced RDDs
[ https://issues.apache.org/jira/browse/SPARK-25302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617822#comment-16617822 ] Apache Spark commented on SPARK-25302: -- User 'nikunjb' has created a pull request for this issue: https://github.com/apache/spark/pull/22423 > ReducedWindowedDStream not using checkpoints for reduced RDDs > - > > Key: SPARK-25302 > URL: https://issues.apache.org/jira/browse/SPARK-25302 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, > 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Nikunj Bansal >Priority: Major > Labels: Streaming, streaming > > When using reduceByKeyAndWindow() using inverse reduce function, it > eventually creates a ReducedWindowedDStream. This class creates a > reducedDStream but only persists it and does not checkpoint it. The result is > that it ends up using cached RDDs and does not cut lineage to the input > DStream resulting in eventually caching the input RDDs for much longer than > they are needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25433: Assignee: Apache Spark > Add support for PEX in PySpark > -- > > Key: SPARK-25433 > URL: https://issues.apache.org/jira/browse/SPARK-25433 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.2 >Reporter: Fabian Höring >Assignee: Apache Spark >Priority: Minor > > The goal of this ticket is to ship and use custom code inside the spark > executors using [PEX|https://github.com/pantsbuild/pex] > This currently works fine with > [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] > (disadvantages are that you have a separate conda package repo and ship the > python interpreter all the time) > Basically the workflow is > * to zip the local conda environment ([conda > pack|https://github.com/conda/conda-pack] also works) > * ship it to each executor as an archive > * modify PYSPARK_PYTHON to the local conda environment > I think it can work the same way with virtual env. There is the SPARK-13587 > ticket to provide nice entry points to spark-submit and SparkContext but > zipping your local virtual env and then just changing PYSPARK_PYTHON env > variable should already work. > I also have seen this > [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. > But recreating the virtual env each time doesn't seem to be a very scalable > solution. If you have hundreds of executors it will retrieve the packages on > each excecutor and recreate your virtual environment each time. Same problem > with this proposal SPARK-16367 from what I understood. > Another problem with virtual env is that your local environment is not easily > shippable to another machine. In particular there is the relocatable option > (see > [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], > > [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] > which makes it very complicated for the user to ship the virtual env and be > sure it works. > And here is where pex comes in. It is a nice way to create a single > executable zip file with all dependencies included. You have the pex command > line tool to build your package and when it is built you are sure it works. > This is in my opinion the most elegant way to ship python code (better than > virtual env and conda) > The problem why it doesn't work out of the box is that there can be only one > single entry point. So just shipping the pex files and setting PYSPARK_PYTHON > to the pex files doesn't work. You can nevertheless tune the env variable > [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] > and runtime to provide different entry points. > PR: [https://github.com/apache/spark/pull/22422/files] > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25302) ReducedWindowedDStream not using checkpoints for reduced RDDs
[ https://issues.apache.org/jira/browse/SPARK-25302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25302: Assignee: (was: Apache Spark) > ReducedWindowedDStream not using checkpoints for reduced RDDs > - > > Key: SPARK-25302 > URL: https://issues.apache.org/jira/browse/SPARK-25302 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, > 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Nikunj Bansal >Priority: Major > Labels: Streaming, streaming > > When using reduceByKeyAndWindow() using inverse reduce function, it > eventually creates a ReducedWindowedDStream. This class creates a > reducedDStream but only persists it and does not checkpoint it. The result is > that it ends up using cached RDDs and does not cut lineage to the input > DStream resulting in eventually caching the input RDDs for much longer than > they are needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24800) Refactor Avro Serializer and Deserializer
[ https://issues.apache.org/jira/browse/SPARK-24800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24800: Description: Currently in Avro data source module, the Avro Deserializer converts input Avro format data to Row, and then convert the Row to InternalRow. The Avro Serializer converts InternalRow to Row, and then output Avro format data. To improve the performance, we need to make a direct conversion between InternalRow and Avro format data. > Refactor Avro Serializer and Deserializer > - > > Key: SPARK-24800 > URL: https://issues.apache.org/jira/browse/SPARK-24800 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > > Currently in Avro data source module, the Avro Deserializer converts input > Avro format data to Row, and then convert the Row to InternalRow. The Avro > Serializer converts InternalRow to Row, and then output Avro format data. To > improve the performance, we need to make a direct conversion between > InternalRow and Avro format data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25291) Flakiness of tests in terms of executor memory (SecretsTestSuite)
[ https://issues.apache.org/jira/browse/SPARK-25291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617617#comment-16617617 ] Apache Spark commented on SPARK-25291: -- User 'ifilonenko' has created a pull request for this issue: https://github.com/apache/spark/pull/22415 > Flakiness of tests in terms of executor memory (SecretsTestSuite) > - > > Key: SPARK-25291 > URL: https://issues.apache.org/jira/browse/SPARK-25291 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Ilan Filonenko >Priority: Major > > SecretsTestSuite shows flakiness in terms of correct setting of executor > memory: > Run SparkPi with env and mount secrets. *** FAILED *** > "[884]Mi" did not equal "[1408]Mi" (KubernetesSuite.scala:272) > When ran with default settings -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25291) Flakiness of tests in terms of executor memory (SecretsTestSuite)
[ https://issues.apache.org/jira/browse/SPARK-25291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25291: Assignee: (was: Apache Spark) > Flakiness of tests in terms of executor memory (SecretsTestSuite) > - > > Key: SPARK-25291 > URL: https://issues.apache.org/jira/browse/SPARK-25291 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Ilan Filonenko >Priority: Major > > SecretsTestSuite shows flakiness in terms of correct setting of executor > memory: > Run SparkPi with env and mount secrets. *** FAILED *** > "[884]Mi" did not equal "[1408]Mi" (KubernetesSuite.scala:272) > When ran with default settings -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25291) Flakiness of tests in terms of executor memory (SecretsTestSuite)
[ https://issues.apache.org/jira/browse/SPARK-25291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25291: Assignee: Apache Spark > Flakiness of tests in terms of executor memory (SecretsTestSuite) > - > > Key: SPARK-25291 > URL: https://issues.apache.org/jira/browse/SPARK-25291 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Ilan Filonenko >Assignee: Apache Spark >Priority: Major > > SecretsTestSuite shows flakiness in terms of correct setting of executor > memory: > Run SparkPi with env and mount secrets. *** FAILED *** > "[884]Mi" did not equal "[1408]Mi" (KubernetesSuite.scala:272) > When ran with default settings -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23829) spark-sql-kafka source in spark 2.3 causes reading stream failure frequently
[ https://issues.apache.org/jira/browse/SPARK-23829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617594#comment-16617594 ] Gabor Somogyi commented on SPARK-23829: --- In 2.4 it's fixed as it's using 2.0.0. I think an upgrade will solve this issue (if description about versions is correct). > spark-sql-kafka source in spark 2.3 causes reading stream failure frequently > > > Key: SPARK-23829 > URL: https://issues.apache.org/jira/browse/SPARK-23829 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Norman Bai >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > In spark 2.3 , it provides a source "spark-sql-kafka-0-10_2.11". > > When I wanted to read from my kafka-0.10.2.1 cluster, it throws out an error > "*java.util.concurrent.TimeoutException: Cannot fetch record for offset > in 12000 milliseconds*" frequently , and the job thus failed. > > I searched on google & stackoverflow for a while, and found many other people > who got this excption too, and nobody gave an answer. > > I debuged the source code, found nothing, but I guess it's because the > dependency spark-sql-kafka-0-10_2.11 is using. > > {code:java} > > org.apache.spark > spark-sql-kafka-0-10_2.11 > 2.3.0 > > > kafka-clients > org.apache.kafka > > > > > org.apache.kafka > kafka-clients > 0.10.2.1 > {code} > I excluded it from maven ,and added another version , rerun the code , and > now it works. > > I guess something is wrong on kafka-clients0.10.0.1 working with > kafka0.10.2.1, or more kafka versions. > > Hope for an explanation. > Here is the error stack. > {code:java} > [ERROR] 2018-03-30 13:34:11,404 [stream execution thread for [id = > 83076cf1-4bf0-4c82-a0b3-23d8432f5964, runId = > b3e18aa6-358f-43f6-a077-e34db0822df6]] > org.apache.spark.sql.execution.streaming.MicroBatchExecution logError - Query > [id = 83076cf1-4bf0-4c82-a0b3-23d8432f5964, runId = > b3e18aa6-358f-43f6-a077-e34db0822df6] terminated with error > org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in > stage 0.0 failed 1 times, most recent failure: Lost task 6.0 in stage 0.0 > (TID 6, localhost, executor driver): java.util.concurrent.TimeoutException: > Cannot fetch record for offset 6481521 in 12 milliseconds > at > org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:230) > at > org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:122) > at > org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68) > at > org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106) > at > org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:157) > at > org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:148) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:107) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:105) > at >
[jira] [Commented] (SPARK-25426) Remove the duplicate fallback logic in UnsafeProjection
[ https://issues.apache.org/jira/browse/SPARK-25426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617587#comment-16617587 ] Li Yuanjian commented on SPARK-25426: - Resolved by https://github.com/apache/spark/pull/22417. > Remove the duplicate fallback logic in UnsafeProjection > --- > > Key: SPARK-25426 > URL: https://issues.apache.org/jira/browse/SPARK-25426 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 2.5.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25448) [Spark Job History] Job Staged page shows 1000 Jobs only
ABHISHEK KUMAR GUPTA created SPARK-25448: Summary: [Spark Job History] Job Staged page shows 1000 Jobs only Key: SPARK-25448 URL: https://issues.apache.org/jira/browse/SPARK-25448 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.3.1 Environment: Server OS :-SUSE 11 No. of Cluster Node:- 6 Spark Version:- 2.3.1 Reporter: ABHISHEK KUMAR GUPTA 1. configure spark.ui.retainedJobs = 10 in spark-default.conf file of Job History 2. Submit 1 Lakh job from beeline 3. Go to the application ID from Job History Page " Incomplete Application Link" 4. Job tab will list only max 1000 jobs under the application Actual output Completed Jobs: 24952, only showing 952 Staged page should list all completed Jobs in this case 24952 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections
[ https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617473#comment-16617473 ] Wenchen Fan edited comment on SPARK-23580 at 9/17/18 12:57 PM: --- I'm re-targeting to `2.5.0`. There are more tickets coming: SafeProjection with fallback, Predicate with fallback, Ordering with fallback, etc. was (Author: cloud_fan): I'm adding `2.5.0` as a target version. There are more tickets coming: SafeProjection with fallback, Predicate with fallback, Ordering with fallback, etc. > Interpreted mode fallback should be implemented for all expressions & > projections > - > > Key: SPARK-23580 > URL: https://issues.apache.org/jira/browse/SPARK-23580 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > Labels: release-notes > > Spark SQL currently does not support interpreted mode for all expressions and > projections. This is a problem for scenario's where were code generation does > not work, or blows past the JVM class limits. We currently cannot gracefully > fallback. > This ticket is an umbrella to fix this class of problem in Spark SQL. This > work can be divided into two main area's: > - Add interpreted versions for all dataset related expressions. > - Add an interpreted version of {{GenerateUnsafeProjection}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections
[ https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-23580: Target Version/s: 2.5.0 (was: 2.4.0, 2.5.0) > Interpreted mode fallback should be implemented for all expressions & > projections > - > > Key: SPARK-23580 > URL: https://issues.apache.org/jira/browse/SPARK-23580 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > Labels: release-notes > > Spark SQL currently does not support interpreted mode for all expressions and > projections. This is a problem for scenario's where were code generation does > not work, or blows past the JVM class limits. We currently cannot gracefully > fallback. > This ticket is an umbrella to fix this class of problem in Spark SQL. This > work can be divided into two main area's: > - Add interpreted versions for all dataset related expressions. > - Add an interpreted version of {{GenerateUnsafeProjection}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections
[ https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617473#comment-16617473 ] Wenchen Fan commented on SPARK-23580: - I'm adding `2.5.0` as a target version. There are more tickets coming: SafeProjection with fallback, Predicate with fallback, Ordering with fallback, etc. > Interpreted mode fallback should be implemented for all expressions & > projections > - > > Key: SPARK-23580 > URL: https://issues.apache.org/jira/browse/SPARK-23580 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > Labels: release-notes > > Spark SQL currently does not support interpreted mode for all expressions and > projections. This is a problem for scenario's where were code generation does > not work, or blows past the JVM class limits. We currently cannot gracefully > fallback. > This ticket is an umbrella to fix this class of problem in Spark SQL. This > work can be divided into two main area's: > - Add interpreted versions for all dataset related expressions. > - Add an interpreted version of {{GenerateUnsafeProjection}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections
[ https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-23580: Target Version/s: 2.4.0, 2.5.0 (was: 2.4.0, 3.0.0) > Interpreted mode fallback should be implemented for all expressions & > projections > - > > Key: SPARK-23580 > URL: https://issues.apache.org/jira/browse/SPARK-23580 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > Labels: release-notes > > Spark SQL currently does not support interpreted mode for all expressions and > projections. This is a problem for scenario's where were code generation does > not work, or blows past the JVM class limits. We currently cannot gracefully > fallback. > This ticket is an umbrella to fix this class of problem in Spark SQL. This > work can be divided into two main area's: > - Add interpreted versions for all dataset related expressions. > - Add an interpreted version of {{GenerateUnsafeProjection}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections
[ https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-23580: Target Version/s: 2.4.0, 3.0.0 (was: 3.0.0) > Interpreted mode fallback should be implemented for all expressions & > projections > - > > Key: SPARK-23580 > URL: https://issues.apache.org/jira/browse/SPARK-23580 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > Labels: release-notes > > Spark SQL currently does not support interpreted mode for all expressions and > projections. This is a problem for scenario's where were code generation does > not work, or blows past the JVM class limits. We currently cannot gracefully > fallback. > This ticket is an umbrella to fix this class of problem in Spark SQL. This > work can be divided into two main area's: > - Add interpreted versions for all dataset related expressions. > - Add an interpreted version of {{GenerateUnsafeProjection}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections
[ https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-23580: Target Version/s: 3.0.0 (was: 2.4.0) > Interpreted mode fallback should be implemented for all expressions & > projections > - > > Key: SPARK-23580 > URL: https://issues.apache.org/jira/browse/SPARK-23580 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > Labels: release-notes > > Spark SQL currently does not support interpreted mode for all expressions and > projections. This is a problem for scenario's where were code generation does > not work, or blows past the JVM class limits. We currently cannot gracefully > fallback. > This ticket is an umbrella to fix this class of problem in Spark SQL. This > work can be divided into two main area's: > - Add interpreted versions for all dataset related expressions. > - Add an interpreted version of {{GenerateUnsafeProjection}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25447) Support JSON options by schema_of_json
[ https://issues.apache.org/jira/browse/SPARK-25447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617462#comment-16617462 ] Apache Spark commented on SPARK-25447: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/22442 > Support JSON options by schema_of_json > -- > > Key: SPARK-25447 > URL: https://issues.apache.org/jira/browse/SPARK-25447 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > The function schema_of_json doesn't accept any options currently but the > options can impact on schema inferring. Need to support the same options that > from_json() can use on schema inferring. Here is examples of options that > could impact on schema inferring: > * primitivesAsString > * prefersDecimal > * allowComments > * allowUnquotedFieldNames > * allowSingleQuotes > * allowNumericLeadingZeros > * allowNonNumericNumbers > * allowBackslashEscapingAnyCharacter > * allowUnquotedControlChars > Below is possible signature: > {code:scala} > def schema_of_json(e: Column, options: java.util.Map[String, String]): Column > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25447) Support JSON options by schema_of_json
[ https://issues.apache.org/jira/browse/SPARK-25447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617461#comment-16617461 ] Apache Spark commented on SPARK-25447: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/22442 > Support JSON options by schema_of_json > -- > > Key: SPARK-25447 > URL: https://issues.apache.org/jira/browse/SPARK-25447 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > The function schema_of_json doesn't accept any options currently but the > options can impact on schema inferring. Need to support the same options that > from_json() can use on schema inferring. Here is examples of options that > could impact on schema inferring: > * primitivesAsString > * prefersDecimal > * allowComments > * allowUnquotedFieldNames > * allowSingleQuotes > * allowNumericLeadingZeros > * allowNonNumericNumbers > * allowBackslashEscapingAnyCharacter > * allowUnquotedControlChars > Below is possible signature: > {code:scala} > def schema_of_json(e: Column, options: java.util.Map[String, String]): Column > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25447) Support JSON options by schema_of_json
[ https://issues.apache.org/jira/browse/SPARK-25447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25447: Assignee: Apache Spark > Support JSON options by schema_of_json > -- > > Key: SPARK-25447 > URL: https://issues.apache.org/jira/browse/SPARK-25447 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > The function schema_of_json doesn't accept any options currently but the > options can impact on schema inferring. Need to support the same options that > from_json() can use on schema inferring. Here is examples of options that > could impact on schema inferring: > * primitivesAsString > * prefersDecimal > * allowComments > * allowUnquotedFieldNames > * allowSingleQuotes > * allowNumericLeadingZeros > * allowNonNumericNumbers > * allowBackslashEscapingAnyCharacter > * allowUnquotedControlChars > Below is possible signature: > {code:scala} > def schema_of_json(e: Column, options: java.util.Map[String, String]): Column > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25447) Support JSON options by schema_of_json
[ https://issues.apache.org/jira/browse/SPARK-25447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25447: Assignee: (was: Apache Spark) > Support JSON options by schema_of_json > -- > > Key: SPARK-25447 > URL: https://issues.apache.org/jira/browse/SPARK-25447 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > The function schema_of_json doesn't accept any options currently but the > options can impact on schema inferring. Need to support the same options that > from_json() can use on schema inferring. Here is examples of options that > could impact on schema inferring: > * primitivesAsString > * prefersDecimal > * allowComments > * allowUnquotedFieldNames > * allowSingleQuotes > * allowNumericLeadingZeros > * allowNonNumericNumbers > * allowBackslashEscapingAnyCharacter > * allowUnquotedControlChars > Below is possible signature: > {code:scala} > def schema_of_json(e: Column, options: java.util.Map[String, String]): Column > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25447) Support JSON options by schema_of_json
Maxim Gekk created SPARK-25447: -- Summary: Support JSON options by schema_of_json Key: SPARK-25447 URL: https://issues.apache.org/jira/browse/SPARK-25447 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk The function schema_of_json doesn't accept any options currently but the options can impact on schema inferring. Need to support the same options that from_json() can use on schema inferring. Here is examples of options that could impact on schema inferring: * primitivesAsString * prefersDecimal * allowComments * allowUnquotedFieldNames * allowSingleQuotes * allowNumericLeadingZeros * allowNonNumericNumbers * allowBackslashEscapingAnyCharacter * allowUnquotedControlChars Below is possible signature: {code:scala} def schema_of_json(e: Column, options: java.util.Map[String, String]): Column {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25431) Fix function examples and unify the format of the example results.
[ https://issues.apache.org/jira/browse/SPARK-25431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-25431: - Fix Version/s: 2.4.1 > Fix function examples and unify the format of the example results. > -- > > Key: SPARK-25431 > URL: https://issues.apache.org/jira/browse/SPARK-25431 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Minor > Fix For: 3.0.0, 2.4.1 > > > There are some mistakes in examples of newly added functions. Also the format > of the example results are not unified. We should fix and unify them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25431) Fix function examples and unify the format of the example results.
[ https://issues.apache.org/jira/browse/SPARK-25431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25431. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22437 [https://github.com/apache/spark/pull/22437] > Fix function examples and unify the format of the example results. > -- > > Key: SPARK-25431 > URL: https://issues.apache.org/jira/browse/SPARK-25431 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Minor > Fix For: 3.0.0 > > > There are some mistakes in examples of newly added functions. Also the format > of the example results are not unified. We should fix and unify them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25374) SafeProjection supports fallback to an interpreted mode
[ https://issues.apache.org/jira/browse/SPARK-25374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617453#comment-16617453 ] Liang-Chi Hsieh commented on SPARK-25374: - I do think so. > SafeProjection supports fallback to an interpreted mode > --- > > Key: SPARK-25374 > URL: https://issues.apache.org/jira/browse/SPARK-25374 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > In SPARK-23711, UnsafeProjection supports fallback to an interpreted mode. > SafeProjection needs to support, too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25446) Add schema_of_json() to R
Maxim Gekk created SPARK-25446: -- Summary: Add schema_of_json() to R Key: SPARK-25446 URL: https://issues.apache.org/jira/browse/SPARK-25446 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk The function schem_of_json() is exposed in Scala/Java and Python but not in R. Need to add the function to R too. Function declaration can be found there: https://github.com/apache/spark/blob/d749d034a80f528932f613ac97f13cfb99acd207/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3612 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25374) SafeProjection supports fallback to an interpreted mode
[ https://issues.apache.org/jira/browse/SPARK-25374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617437#comment-16617437 ] Takeshi Yamamuro commented on SPARK-25374: -- I do not have a strong opinion though, I feel it is too late to push this into 2.4. > SafeProjection supports fallback to an interpreted mode > --- > > Key: SPARK-25374 > URL: https://issues.apache.org/jira/browse/SPARK-25374 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > In SPARK-23711, UnsafeProjection supports fallback to an interpreted mode. > SafeProjection needs to support, too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25443) fix issues when building docs with release scripts in docker
[ https://issues.apache.org/jira/browse/SPARK-25443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617428#comment-16617428 ] Apache Spark commented on SPARK-25443: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/22438 > fix issues when building docs with release scripts in docker > > > Key: SPARK-25443 > URL: https://issues.apache.org/jira/browse/SPARK-25443 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25443) fix issues when building docs with release scripts in docker
[ https://issues.apache.org/jira/browse/SPARK-25443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25443: Assignee: Wenchen Fan (was: Apache Spark) > fix issues when building docs with release scripts in docker > > > Key: SPARK-25443 > URL: https://issues.apache.org/jira/browse/SPARK-25443 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25443) fix issues when building docs with release scripts in docker
[ https://issues.apache.org/jira/browse/SPARK-25443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25443: Assignee: Apache Spark (was: Wenchen Fan) > fix issues when building docs with release scripts in docker > > > Key: SPARK-25443 > URL: https://issues.apache.org/jira/browse/SPARK-25443 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25443) fix issues when building docs with release scripts in docker
[ https://issues.apache.org/jira/browse/SPARK-25443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617427#comment-16617427 ] Apache Spark commented on SPARK-25443: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/22438 > fix issues when building docs with release scripts in docker > > > Key: SPARK-25443 > URL: https://issues.apache.org/jira/browse/SPARK-25443 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated SPARK-25433: -- Description: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] (disadvantages are that you have a separate conda package repo and ship the python interpreter all the time) Basically the workflow is * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think it can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing PYSPARK_PYTHON env variable should already work. I also have seen this [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the packages on each excecutor and recreate your virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide different entry points. PR: [https://github.com/apache/spark/pull/22422/files] was: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] (disadvantages are that you have a separate conda package repo and ship the python interpreter all the time) Basically the workflow is * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think it can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing the PYSPARK_PYTHON should already work. I also have seen this [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the packages on each excecutor and recreate your virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to
[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated SPARK-25433: -- Description: The goal of this ticket is to ship and use custom code inside the spark executors using [PEX|https://github.com/pantsbuild/pex] This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] (disadvantages are that you have a separate conda package repo and ship the python interpreter all the time) Basically the workflow is * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think it can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing PYSPARK_PYTHON env variable should already work. I also have seen this [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the packages on each excecutor and recreate your virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where pex comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide different entry points. PR: [https://github.com/apache/spark/pull/22422/files] was: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] (disadvantages are that you have a separate conda package repo and ship the python interpreter all the time) Basically the workflow is * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think it can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing PYSPARK_PYTHON env variable should already work. I also have seen this [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the packages on each excecutor and recreate your virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated SPARK-25433: -- Description: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] (disadvantages are that you have a separate conda package repo and ship the python interpreter all the time) Basically the workflow is * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think it can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing the PYSPARK_PYTHON should already work. I also have seen this [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the packages on each excecutor and recreate your virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide different entry points. PR: [https://github.com/apache/spark/pull/22422/files] was: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] (disadvantages are that you have a separate conda package repo and ship the python interpreter all the time) Basically the workflow is * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think it can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing the PYSPARK_PYTHON should already work. I also have seen this [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the package from each excecutor and recreate your virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide
[jira] [Commented] (SPARK-25431) Fix function examples and unify the format of the example results.
[ https://issues.apache.org/jira/browse/SPARK-25431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617418#comment-16617418 ] Apache Spark commented on SPARK-25431: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/22437 > Fix function examples and unify the format of the example results. > -- > > Key: SPARK-25431 > URL: https://issues.apache.org/jira/browse/SPARK-25431 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Minor > > There are some mistakes in examples of newly added functions. Also the format > of the example results are not unified. We should fix and unify them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25431) Fix function examples and unify the format of the example results.
[ https://issues.apache.org/jira/browse/SPARK-25431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25431: Assignee: Apache Spark (was: Takuya Ueshin) > Fix function examples and unify the format of the example results. > -- > > Key: SPARK-25431 > URL: https://issues.apache.org/jira/browse/SPARK-25431 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Minor > > There are some mistakes in examples of newly added functions. Also the format > of the example results are not unified. We should fix and unify them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25431) Fix function examples and unify the format of the example results.
[ https://issues.apache.org/jira/browse/SPARK-25431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25431: Assignee: Takuya Ueshin (was: Apache Spark) > Fix function examples and unify the format of the example results. > -- > > Key: SPARK-25431 > URL: https://issues.apache.org/jira/browse/SPARK-25431 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Minor > > There are some mistakes in examples of newly added functions. Also the format > of the example results are not unified. We should fix and unify them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated SPARK-25433: -- Description: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] (disadvantages are that you have a separate conda package repo and ship the python interpreter all the time) Basically the workflow is * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think it can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing the PYSPARK_PYTHON should already work. I also have seen this [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the package from each excecutor and recreate your virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide different entry points. PR: [https://github.com/apache/spark/pull/22422/files] was: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] (disadvantages are that you have a separate conda package repo and ship the python interpreter all the time) Basically the workflow is * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think it can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing the PYSPARK_PYTHON should already work. I also have seen this [blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the package from each excecutor and recreate your virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide
[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated SPARK-25433: -- Description: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] (disadvantages are that you have a separate conda package repo and ship the python interpreter all the time) Basically the workflow is * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think it can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing the PYSPARK_PYTHON should already work. I also have seen this [blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the package from each excecutor and recreate your virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide different entry points. PR: [https://github.com/apache/spark/pull/22422/files] was: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] (disadvantages are that you have a separate conda package repo and ship the python interpreter all the time.) Basically the workflow is * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think it can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing the PYSPARK_PYTHON should already work. I also have seen this [blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the package from each excecutor and recreate your virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to
[jira] [Comment Edited] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617399#comment-16617399 ] Fabian Höring edited comment on SPARK-25433 at 9/17/18 11:40 AM: - [~hyukjin.kwon] I changed the description of the ticket including links to existing attempts. was (Author: fhoering): [~hyukjin.kwon] I changed the description of the ticket including link to existing attempts. > Add support for PEX in PySpark > -- > > Key: SPARK-25433 > URL: https://issues.apache.org/jira/browse/SPARK-25433 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.2 >Reporter: Fabian Höring >Priority: Minor > > The goal of this ticket is to ship and use custom code inside the spark > executors. > This currently works fine with > [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] > (disadvantages are that you have a separate conda package repo and ship the > python interpreter all the time) > Basically the workflow is > * to zip the local conda environment ([conda > pack|https://github.com/conda/conda-pack] also works) > * ship it to each executor as an archive > * modify PYSPARK_PYTHON to the local conda environment > I think it can work the same way with virtual env. There is the SPARK-13587 > ticket to provide nice entry points to spark-submit and SparkContext but > zipping your local virtual env and then just changing the PYSPARK_PYTHON > should already work. > I also have seen this > [blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. > But recreating the virtual env each time doesn't seem to be a very scalable > solution. If you have hundreds of executors it will retrieve the package from > each excecutor and recreate your virtual environment each time. Same problem > with this proposal SPARK-16367 from what I understood. > Another problem with virtual env is that your local environment is not easily > shippable to another machine. In particular there is the relocatable option > (see > [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], > > [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] > which makes it very complicated for the user to ship the virtual env and be > sure it works. > And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a > nice way to create a single executable zip file with all dependencies > included. You have the pex command line tool to build your package and when > it is built you are sure it works. This is in my opinion the most elegant way > to ship python code (better than virtual env and conda) > The problem why it doesn't work out of the box is that there can be only one > single entry point. So just shipping the pex files and setting PYSPARK_PYTHON > to the pex files doesn't work. You can nevertheless tune the env variable > [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] > and runtime to provide different entry points. > PR: [https://github.com/apache/spark/pull/22422/files] > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617399#comment-16617399 ] Fabian Höring commented on SPARK-25433: --- [~hyukjin.kwon] I changed the description of the ticket including link to existing attempts. > Add support for PEX in PySpark > -- > > Key: SPARK-25433 > URL: https://issues.apache.org/jira/browse/SPARK-25433 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.2 >Reporter: Fabian Höring >Priority: Minor > > The goal of this ticket is to ship and use custom code inside the spark > executors. > This currently works fine with > [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] > (disadvantages are that you have a separate conda package repo and ship the > python interpreter all the time.) > Basically the workflow is > * to zip the local conda environment ([conda > pack|https://github.com/conda/conda-pack] also works) > * ship it to each executor as an archive > * modify PYSPARK_PYTHON to the local conda environment > I think it can work the same way with virtual env. There is the SPARK-13587 > ticket to provide nice entry points to spark-submit and SparkContext but > zipping your local virtual env and then just changing the PYSPARK_PYTHON > should already work. > I also have seen this > [blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. > But recreating the virtual env each time doesn't seem to be a very scalable > solution. If you have hundreds of executors it will retrieve the package from > each excecutor and recreate your virtual environment each time. Same problem > with this proposal SPARK-16367 from what I understood. > Another problem with virtual env is that your local environment is not easily > shippable to another machine. In particular there is the relocatable option > (see > [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], > > [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] > which makes it very complicated for the user to ship the virtual env and be > sure it works. > And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a > nice way to create a single executable zip file with all dependencies > included. You have the pex command line tool to build your package and when > it is built you are sure it works. This is in my opinion the most elegant way > to ship python code (better than virtual env and conda) > The problem why it doesn't work out of the box is that there can be only one > single entry point. So just shipping the pex files and setting PYSPARK_PYTHON > to the pex files doesn't work. You can nevertheless tune the env variable > [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] > and runtime to provide different entry points. > PR: [https://github.com/apache/spark/pull/22422/files] > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated SPARK-25433: -- Description: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] (disadvantages are that you have a separate conda package repo and ship the python interpreter all the time.) Basically the workflow is * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think it can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing the PYSPARK_PYTHON should already work. I also have seen this [blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the package from each excecutor and recreate your virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide different entry points. PR: [https://github.com/apache/spark/pull/22422/files] was: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]: Basically the workflow is * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think it can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing the PYSPARK_PYTHON should already work. I also have seen this [blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the package from each excecutor and recreate your virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide different entry points. PR: [https://github.com/apache/spark/pull/22422/files] > Add support for
[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated SPARK-25433: -- Description: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]: Basically the workflow is * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think it can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing the PYSPARK_PYTHON should already work. I also have seen this [blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the package from each excecutor and recreate your virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide different entry points. PR: [https://github.com/apache/spark/pull/22422/files] was: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]: Basically the workflow is * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think it can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing the PYSPARK_PYTHON should already work. I also have seen this [blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the package from each excecutor and recreate your virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable] [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide different entry points. PR: [https://github.com/apache/spark/pull/22422/files] > Add support for PEX in PySpark > -- > > Key: SPARK-25433 > URL:
[jira] [Resolved] (SPARK-25427) Add BloomFilter creation test cases
[ https://issues.apache.org/jira/browse/SPARK-25427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25427. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22418 [https://github.com/apache/spark/pull/22418] > Add BloomFilter creation test cases > --- > > Key: SPARK-25427 > URL: https://issues.apache.org/jira/browse/SPARK-25427 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.2, 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.4.0 > > > Spark supports BloomFilter creation for ORC files. This issue aims to add > test coverages to prevent regressions like SPARK-12417 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated SPARK-25433: -- Description: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]: Basically the workflow is * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think it can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing the PYSPARK_PYTHON should already work. I also have seen this [blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the package from each excecutor and recreate your virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable] [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide different entry points. PR: [https://github.com/apache/spark/pull/22422/files] was: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]: Basically the workflow is * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think its can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing the PYSPARK_PYTHON should already work. I also have seen this [blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the package from each excecutor and recreate your virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable] [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide different entry points. PR: [https://github.com/apache/spark/pull/22422/files] > Add support for PEX in PySpark > -- > > Key: SPARK-25433 > URL:
[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated SPARK-25433: -- Description: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]: Basically the workflow is * to zip the local conda environment ([conda pack|https://github.com/conda/conda-pack] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think its can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing the PYSPARK_PYTHON should already work. I also have seen this [blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the package from each excecutor and recreate your virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable] [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide different entry points. PR: [https://github.com/apache/spark/pull/22422/files] was: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]: Basically the workflow is * to zip the local conda environment ([conda pack|[https://github.com/conda/conda-pack]] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think its can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing the PYSPARK_PYTHON should already work. I also have seen this [blogpost |[https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].] But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the package from each excecutor and recreate your virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable,] [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide different entry points. PR: [https://github.com/apache/spark/pull/22422/files] > Add support for PEX in PySpark > -- > > Key: SPARK-25433 >
[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated SPARK-25433: -- Description: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]: Basically the workflow is * to zip the local conda environment ([conda pack|[https://github.com/conda/conda-pack]] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think its can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing the PYSPARK_PYTHON should already work. I also have seen this [blogpost |[https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].] But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the package from each excecutor and recreate your virtual environment each time. Same problem with this proposal SPARK-16367 from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable,] [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide different entry points. PR: [https://github.com/apache/spark/pull/22422/files] was: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]: Basically the workflow is * to zip the local conda environment ([conda pack|[https://github.com/conda/conda-pack]] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think its can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing the PYSPARK_PYTHON should already work. I also have seen this [blogpost |[https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].] But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the package from each excecutor and recreate your virtual environment each time. Same problem with this proposal from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable,] [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide different entry points. PR: [https://github.com/apache/spark/pull/22422/files] > Add support for PEX in PySpark > -- > > Key: SPARK-25433 > URL:
[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated SPARK-25433: -- Description: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]: Basically the workflow is * to zip the local conda environment ([conda pack|[https://github.com/conda/conda-pack]] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think its can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing the PYSPARK_PYTHON should already work. I also have seen this [blogpost |[https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].] But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the package from each excecutor and recreate your virtual environment each time. Same problem with this proposal from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable,] [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide different entry points. PR: [https://github.com/apache/spark/pull/22422/files] was: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]: Basically the workflow is * to zip the local conda environment ([conda pack|[https://github.com/conda/conda-pack]] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think its can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing the PYSPARK_PYTHON should already work. I also have seen this [blogpost |[https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].] But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the package from each excecutor and recreate your virtual environment each time. Same problem with [this proposal|https://issues.apache.org/jira/browse/SPARK-16367] from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable,] [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide different entry points. PR: [https://github.com/apache/spark/pull/22422/files] > Add support for PEX in PySpark > -- > >
[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Höring updated SPARK-25433: -- Description: The goal of this ticket is to ship and use custom code inside the spark executors. This currently works fine with [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]: Basically the workflow is * to zip the local conda environment ([conda pack|[https://github.com/conda/conda-pack]] also works) * ship it to each executor as an archive * modify PYSPARK_PYTHON to the local conda environment I think its can work the same way with virtual env. There is the SPARK-13587 ticket to provide nice entry points to spark-submit and SparkContext but zipping your local virtual env and then just changing the PYSPARK_PYTHON should already work. I also have seen this [blogpost |[https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].] But recreating the virtual env each time doesn't seem to be a very scalable solution. If you have hundreds of executors it will retrieve the package from each excecutor and recreate your virtual environment each time. Same problem with [this proposal|https://issues.apache.org/jira/browse/SPARK-16367] from what I understood. Another problem with virtual env is that your local environment is not easily shippable to another machine. In particular there is the relocatable option (see [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable,] [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] which makes it very complicated for the user to ship the virtual env and be sure it works. And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a nice way to create a single executable zip file with all dependencies included. You have the pex command line tool to build your package and when it is built you are sure it works. This is in my opinion the most elegant way to ship python code (better than virtual env and conda) The problem why it doesn't work out of the box is that there can be only one single entry point. So just shipping the pex files and setting PYSPARK_PYTHON to the pex files doesn't work. You can nevertheless tune the env variable [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] and runtime to provide different entry points. PR: [https://github.com/apache/spark/pull/22422/files] was: This has been partly discussed in SPARK-13587 I would like to provision the executors with a PEX package. I created a PR with minimal necessary changes in PythonWorkerFactory. PR: [https://github.com/apache/spark/pull/22422/files] To run it one needs to set PYSPARK_PYTHON & PYSPARK_DRIVER_PYTHON variables to the pex file and upload the pex file to the executors via sparkContext.addFile or by setting the spark config spark.yarn.dist.files/spark.file properties Also it is necessary to set the PEX_ROOT environment variable. By default inside the executors it tries to access /home/.pex and this fails. Ideally, as this configuration is quite cumbersome, it would be interesting to also add a parameter --pexFile to SparkContext and spark-submit in order to directly provide a pexFile and then everything else is handled. Please tell me what you think of this. > Add support for PEX in PySpark > -- > > Key: SPARK-25433 > URL: https://issues.apache.org/jira/browse/SPARK-25433 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.2 >Reporter: Fabian Höring >Priority: Minor > > The goal of this ticket is to ship and use custom code inside the spark > executors. > This currently works fine with > [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]: > Basically the workflow is > * to zip the local conda environment ([conda > pack|[https://github.com/conda/conda-pack]] also works) > * ship it to each executor as an archive > * modify PYSPARK_PYTHON to the local conda environment > I think its can work the same way with virtual env. There is the SPARK-13587 > ticket to provide nice entry points to spark-submit and SparkContext but > zipping your local virtual env and then just changing the PYSPARK_PYTHON > should already work. > I also have seen this [blogpost > |[https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].] > But recreating the virtual env each time doesn't seem to be a very scalable > solution. If you have hundreds of executors it will retrieve the package from > each excecutor and recreate your virtual environment each time. Same problem > with [this proposal|https://issues.apache.org/jira/browse/SPARK-16367]