[jira] [Updated] (SPARK-39817) Missing sbin scripts in PySpark packages
[ https://issues.apache.org/jira/browse/SPARK-39817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] F. H. updated SPARK-39817: -- Description: In the PySpark setup.py, only a subset of all scripts is included. I'm in particular missing the `submit-all.sh` script: {code:python} package_data={ 'pyspark.jars': ['*.jar'], 'pyspark.bin': ['*'], 'pyspark.sbin': ['spark-config.sh', 'spark-daemon.sh', 'start-history-server.sh', 'stop-history-server.sh', ], [...] }, {code} The solution is super simple, just change 'pyspark.sbin' to: {code:python} 'pyspark.sbin': ['*'], {code} I would happily submit a PR to github, but I have no clue on the organizational details. This would be great to get backported for pyspark 3.2.x as well as 3.3.x soon. was: In the PySpark setup.py, only a subset of all scripts is included. I'm in particular missing the `submit-all.sh` script: {code:python} package_data={ 'pyspark.jars': ['*.jar'], 'pyspark.bin': ['*'], 'pyspark.sbin': ['spark-config.sh', 'spark-daemon.sh', 'start-history-server.sh', 'stop-history-server.sh', ], [...] }, {code} The solution is super simple: Just change 'pyspark.sbin' to: {code:python} 'pyspark.sbin': ['*'], {code} I would happily submit a PR to github, but I have no clue on the organizational details. This would be great to get backported for pyspark 3.2.x as well as 3.3.x soon. > Missing sbin scripts in PySpark packages > > > Key: SPARK-39817 > URL: https://issues.apache.org/jira/browse/SPARK-39817 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: F. H. >Priority: Major > Labels: easyfix > Original Estimate: 5m > Remaining Estimate: 5m > > In the PySpark setup.py, only a subset of all scripts is included. > I'm in particular missing the `submit-all.sh` script: > {code:python} > package_data={ > 'pyspark.jars': ['*.jar'], > 'pyspark.bin': ['*'], > 'pyspark.sbin': ['spark-config.sh', 'spark-daemon.sh', > 'start-history-server.sh', > 'stop-history-server.sh', ], > [...] > }, > {code} > > The solution is super simple, just change 'pyspark.sbin' to: > {code:python} > 'pyspark.sbin': ['*'], > {code} > > I would happily submit a PR to github, but I have no clue on the > organizational details. > This would be great to get backported for pyspark 3.2.x as well as 3.3.x soon. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39817) Missing sbin scripts in PySpark packages
F. H. created SPARK-39817: - Summary: Missing sbin scripts in PySpark packages Key: SPARK-39817 URL: https://issues.apache.org/jira/browse/SPARK-39817 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.2.2, 3.3.0, 3.2.1, 3.2.0 Reporter: F. H. In the PySpark setup.py, only a subset of all scripts is included. I'm in particular missing the `submit-all.sh` script: {code:python} package_data={ 'pyspark.jars': ['*.jar'], 'pyspark.bin': ['*'], 'pyspark.sbin': ['spark-config.sh', 'spark-daemon.sh', 'start-history-server.sh', 'stop-history-server.sh', ], [...] }, {code} The solution is super simple: Just change 'pyspark.sbin' to: {code:python} 'pyspark.sbin': ['*'], {code} I would happily submit a PR to github, but I have no clue on the organizational details. This would be great to get backported for pyspark 3.2.x as well as 3.3.x soon. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18591) Replace hash-based aggregates with sort-based ones if inputs already sorted
[ https://issues.apache.org/jira/browse/SPARK-18591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476233#comment-17476233 ] F. H. commented on SPARK-18591: --- I just found the same behavior. Even if I directly put ".sort(*group_keys)" before the groupby, Spark is hash-repartitioning. Is there a solution to this? > Replace hash-based aggregates with sort-based ones if inputs already sorted > --- > > Key: SPARK-18591 > URL: https://issues.apache.org/jira/browse/SPARK-18591 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2 >Reporter: Takeshi Yamamuro >Priority: Major > Labels: bulk-closed > > Spark currently uses sort-based aggregates only in limited condition; the > cases where spark cannot use partial aggregates and hash-based ones. > However, if input ordering has already satisfied the requirements of > sort-based aggregates, it seems sort-based ones are faster than the other. > {code} > ./bin/spark-shell --conf spark.sql.shuffle.partitions=1 > val df = spark.range(1000).selectExpr("id AS key", "id % 10 AS > value").sort($"key").cache > def timer[R](block: => R): R = { > val t0 = System.nanoTime() > val result = block > val t1 = System.nanoTime() > println("Elapsed time: " + ((t1 - t0 + 0.0) / 10.0)+ "s") > result > } > timer { > df.groupBy("key").count().count > } > // codegen'd hash aggregate > Elapsed time: 7.116962977s > // non-codegen'd sort aggregarte > Elapsed time: 3.088816662s > {code} > If codegen'd sort-based aggregates are supported in SPARK-16844, this seems > to make the performance gap bigger; > {code} > - codegen'd sort aggregate > Elapsed time: 1.645234684s > {code} > Therefore, it'd be better to use sort-based ones in this case. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35717) pandas_udf crashes in conjunction with .filter()
F. H. created SPARK-35717: - Summary: pandas_udf crashes in conjunction with .filter() Key: SPARK-35717 URL: https://issues.apache.org/jira/browse/SPARK-35717 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.1.2, 3.1.1, 3.0.0 Environment: Centos 8 with PySpark from conda Reporter: F. H. I wrote the following UDF that always returns some "byte"-type array: {code:python} from typing import Iterator @f.pandas_udf(returnType=t.ByteType()) def spark_gt_mapping_fn(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]: mapping = dict() mapping[(-1, -1)] = -1 mapping[(0, 0)] = 0 mapping[(0, 1)] = 1 mapping[(1, 0)] = 1 mapping[(1, 1)] = 2 def gt_mapping_fn(v): if len(v) != 2: return -3 else: a, b = v return mapping.get((a, b), -2) for x in batch_iter: yield x.apply(gt_mapping_fn).astype("int8") {code} However, every time I'd like to filter on the resulting column, I get the following error: {code:python} # works: ( df .select(spark_gt_mapping_fn(f.col("genotype.calls")).alias("GT")) .limit(10).toPandas() ) # fails: ( df .select(spark_gt_mapping_fn(f.col("genotype.calls")).alias("GT")) .filter("GT > 0") .limit(10).toPandas() ) {code} {code:java} Py4JJavaError: An error occurred while calling o672.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 125) (ouga05.cmm.in.tum.de executor driver): org.apache.spark.util.TaskCompletionListenerException: Memory was leaked by query. Memory leaked: (16384) Allocator(stdin reader for python3) 0/16384/34816/9223372036854775807 (res/actual/peak/limit) at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:145) at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:124) at org.apache.spark.scheduler.Task.run(Task.scala:147) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:472) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:425) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47) at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3519) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at
[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle
[ https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332011#comment-17332011 ] F. H. commented on SPARK-5997: -- Is there any solution to this issue? I'd like to split my dataset by one key and then further split each partition s.t. I reach a certain number of partitions. > Increase partition count without performing a shuffle > - > > Key: SPARK-5997 > URL: https://issues.apache.org/jira/browse/SPARK-5997 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Andrew Ash >Priority: Major > > When decreasing partition count with rdd.repartition() or rdd.coalesce(), the > user has the ability to choose whether or not to perform a shuffle. However > when increasing partition count there is no option of whether to perform a > shuffle or not -- a shuffle always occurs. > This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call > that performs a repartition to a higher partition count without a shuffle. > The motivating use case is to decrease the size of an individual partition > enough that the .toLocalIterator has significantly reduced memory pressure on > the driver, as it loads a partition at a time into the driver. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31374) Returning complex types in Pandas UDF
F. H. created SPARK-31374: - Summary: Returning complex types in Pandas UDF Key: SPARK-31374 URL: https://issues.apache.org/jira/browse/SPARK-31374 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.0.0 Reporter: F. H. I would like to return a complex type in an GROUPED_AGG operation: {code:python} window_overlap_schema = t.StructType([ t.StructField("counts", t.ArrayType(t.LongType())), t.StructField("starts", t.ArrayType(t.LongType())), t.StructField("ends", t.ArrayType(t.LongType())), ]) @f.pandas_udf(window_overlap_schema, f.PandasUDFType.GROUPED_AGG) def spark_window_overlap([...]): [...] {code} However, I get the following error when trying to run this: {code:python} NotImplementedError: Invalid returnType with grouped aggregate Pandas UDFs: StructType(List(StructField(counts,ArrayType(LongType,true),true),StructField(starts,ArrayType(LongType,true),true),StructField(ends,ArrayType(LongType,true),true))) is not supported {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org