[jira] [Updated] (SPARK-39817) Missing sbin scripts in PySpark packages

2022-07-19 Thread F. H. (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

F. H. updated SPARK-39817:
--
Description: 
In the PySpark setup.py, only a subset of all scripts is included.
I'm in particular missing the `submit-all.sh` script:
{code:python}
package_data={
'pyspark.jars': ['*.jar'],
'pyspark.bin': ['*'],
'pyspark.sbin': ['spark-config.sh', 'spark-daemon.sh',
 'start-history-server.sh',
 'stop-history-server.sh', ],

[...]
},
{code}
 

The solution is super simple, just change 'pyspark.sbin' to:
{code:python}
'pyspark.sbin': ['*'],
{code}
 

I would happily submit a PR to github, but I have no clue on the organizational 
details.

This would be great to get backported for pyspark 3.2.x as well as 3.3.x soon.

  was:
In the PySpark setup.py, only a subset of all scripts is included.
I'm in particular missing the `submit-all.sh` script:
{code:python}
package_data={
'pyspark.jars': ['*.jar'],
'pyspark.bin': ['*'],
'pyspark.sbin': ['spark-config.sh', 'spark-daemon.sh',
 'start-history-server.sh',
 'stop-history-server.sh', ],

[...]
},
{code}
 

The solution is super simple: Just change 'pyspark.sbin' to:
{code:python}
'pyspark.sbin': ['*'],
{code}
 

I would happily submit a PR to github, but I have no clue on the organizational 
details.

This would be great to get backported for pyspark 3.2.x as well as 3.3.x soon.


> Missing sbin scripts in PySpark packages
> 
>
> Key: SPARK-39817
> URL: https://issues.apache.org/jira/browse/SPARK-39817
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: F. H.
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> In the PySpark setup.py, only a subset of all scripts is included.
> I'm in particular missing the `submit-all.sh` script:
> {code:python}
> package_data={
> 'pyspark.jars': ['*.jar'],
> 'pyspark.bin': ['*'],
> 'pyspark.sbin': ['spark-config.sh', 'spark-daemon.sh',
>  'start-history-server.sh',
>  'stop-history-server.sh', ],
> [...]
> },
> {code}
>  
> The solution is super simple, just change 'pyspark.sbin' to:
> {code:python}
> 'pyspark.sbin': ['*'],
> {code}
>  
> I would happily submit a PR to github, but I have no clue on the 
> organizational details.
> This would be great to get backported for pyspark 3.2.x as well as 3.3.x soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39817) Missing sbin scripts in PySpark packages

2022-07-19 Thread F. H. (Jira)
F. H. created SPARK-39817:
-

 Summary: Missing sbin scripts in PySpark packages
 Key: SPARK-39817
 URL: https://issues.apache.org/jira/browse/SPARK-39817
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.2, 3.3.0, 3.2.1, 3.2.0
Reporter: F. H.


In the PySpark setup.py, only a subset of all scripts is included.
I'm in particular missing the `submit-all.sh` script:
{code:python}
package_data={
'pyspark.jars': ['*.jar'],
'pyspark.bin': ['*'],
'pyspark.sbin': ['spark-config.sh', 'spark-daemon.sh',
 'start-history-server.sh',
 'stop-history-server.sh', ],

[...]
},
{code}
 

The solution is super simple: Just change 'pyspark.sbin' to:
{code:python}
'pyspark.sbin': ['*'],
{code}
 

I would happily submit a PR to github, but I have no clue on the organizational 
details.

This would be great to get backported for pyspark 3.2.x as well as 3.3.x soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18591) Replace hash-based aggregates with sort-based ones if inputs already sorted

2022-01-14 Thread F. H. (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476233#comment-17476233
 ] 

F. H. commented on SPARK-18591:
---

I just found the same behavior. Even if I directly put ".sort(*group_keys)" 
before the groupby, Spark is hash-repartitioning. Is there a solution to this?

> Replace hash-based aggregates with sort-based ones if inputs already sorted
> ---
>
> Key: SPARK-18591
> URL: https://issues.apache.org/jira/browse/SPARK-18591
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Takeshi Yamamuro
>Priority: Major
>  Labels: bulk-closed
>
> Spark currently uses sort-based aggregates only in limited condition; the 
> cases where spark cannot use partial aggregates and hash-based ones.
> However, if input ordering has already satisfied the requirements of 
> sort-based aggregates, it seems sort-based ones are faster than the other.
> {code}
> ./bin/spark-shell --conf spark.sql.shuffle.partitions=1
> val df = spark.range(1000).selectExpr("id AS key", "id % 10 AS 
> value").sort($"key").cache
> def timer[R](block: => R): R = {
>   val t0 = System.nanoTime()
>   val result = block
>   val t1 = System.nanoTime()
>   println("Elapsed time: " + ((t1 - t0 + 0.0) / 10.0)+ "s")
>   result
> }
> timer {
>   df.groupBy("key").count().count
> }
> // codegen'd hash aggregate
> Elapsed time: 7.116962977s
> // non-codegen'd sort aggregarte
> Elapsed time: 3.088816662s
> {code}
> If codegen'd sort-based aggregates are supported in SPARK-16844, this seems 
> to make the performance gap bigger;
> {code}
> - codegen'd sort aggregate
> Elapsed time: 1.645234684s
> {code} 
> Therefore, it'd be better to use sort-based ones in this case.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35717) pandas_udf crashes in conjunction with .filter()

2021-06-10 Thread F. H. (Jira)
F. H. created SPARK-35717:
-

 Summary: pandas_udf crashes in conjunction with .filter()
 Key: SPARK-35717
 URL: https://issues.apache.org/jira/browse/SPARK-35717
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.2, 3.1.1, 3.0.0
 Environment: Centos 8 with PySpark from conda
Reporter: F. H.


I wrote the following UDF that always returns some "byte"-type array:

 
{code:python}
from typing import Iterator

@f.pandas_udf(returnType=t.ByteType())
def spark_gt_mapping_fn(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]:
mapping = dict()
mapping[(-1, -1)] = -1
mapping[(0, 0)] = 0
mapping[(0, 1)] = 1
mapping[(1, 0)] = 1
mapping[(1, 1)] = 2

def gt_mapping_fn(v):
if len(v) != 2:
return -3
else:
a, b = v
return mapping.get((a, b), -2)

for x in batch_iter:
 yield x.apply(gt_mapping_fn).astype("int8")
{code}
 

However, every time I'd like to filter on the resulting column, I get the 
following error:
{code:python}
# works:
(
df
.select(spark_gt_mapping_fn(f.col("genotype.calls")).alias("GT"))
.limit(10).toPandas()
)
# fails:
(
df
.select(spark_gt_mapping_fn(f.col("genotype.calls")).alias("GT"))
.filter("GT > 0")
.limit(10).toPandas()
)
{code}
{code:java}
Py4JJavaError: An error occurred while calling o672.collectToPython. : 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 
125) (ouga05.cmm.in.tum.de executor driver): 
org.apache.spark.util.TaskCompletionListenerException: Memory was leaked by 
query. Memory leaked: (16384) Allocator(stdin reader for python3) 
0/16384/34816/9223372036854775807 (res/actual/peak/limit) at 
org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:145) at 
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:124) 
at org.apache.spark.scheduler.Task.run(Task.scala:147) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
 at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206) at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
 at scala.Option.foreach(Option.scala:407) at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2196) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2217) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2236) at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:472) at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:425) at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47) 
at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3519) 
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687) at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at 

[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle

2021-04-26 Thread F. H. (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332011#comment-17332011
 ] 

F. H. commented on SPARK-5997:
--

Is there any solution to this issue?

I'd like to split my dataset by one key and then further split each partition 
s.t. I reach a certain number of partitions.

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31374) Returning complex types in Pandas UDF

2020-04-07 Thread F. H. (Jira)
F. H. created SPARK-31374:
-

 Summary: Returning complex types in Pandas UDF
 Key: SPARK-31374
 URL: https://issues.apache.org/jira/browse/SPARK-31374
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: F. H.


I would like to return a complex type in an GROUPED_AGG operation:

{code:python}
window_overlap_schema = t.StructType([
 t.StructField("counts", t.ArrayType(t.LongType())),
 t.StructField("starts", t.ArrayType(t.LongType())),
 t.StructField("ends", t.ArrayType(t.LongType())),
])

@f.pandas_udf(window_overlap_schema, f.PandasUDFType.GROUPED_AGG)
def spark_window_overlap([...]):
[...]
{code}


However, I get the following error when trying to run this:

{code:python}
NotImplementedError: Invalid returnType with grouped aggregate Pandas UDFs: 
StructType(List(StructField(counts,ArrayType(LongType,true),true),StructField(starts,ArrayType(LongType,true),true),StructField(ends,ArrayType(LongType,true),true)))
 is not supported
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org