[ 
https://issues.apache.org/jira/browse/SPARK-46636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Patil updated SPARK-46636:
----------------------------------
    Affects Version/s: 3.4.1
                           (was: 3.3.4)

> Pyspark throwing TypeError while collecting a RDD
> -------------------------------------------------
>
>                 Key: SPARK-46636
>                 URL: https://issues.apache.org/jira/browse/SPARK-46636
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.4.1
>         Environment: Running this in anaconda jupyter notebook
> Python== 3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) 
> [MSC v.1916 64 bit (AMD64)]
> Spark== 3.3.4
> pyspark== 3.4.1
>            Reporter: Shubham Patil
>            Priority: Major
>
> Im trying to collect a RDD after applying a filter on it but its throwing an 
> error.
>  
> Error can be reproduced from below code
> {code:java}
> from pyspark.sql import SparkSession
> spark = 
> SparkSession.builder.master("local[*]").appName("Practice").getOrCreate()
> sc = spark.sparkContext
> data = [1,2,3,4,5,6,7,8,9,10,11,12]
> dataRdd = sc.parallelize(data)
> dataRdd = dataRdd.filter(lambda a: a%2==0)
> dataRdd.collect() {code}
> Below is the error that its throwing:
>  
> {code:java}
> --------------------------------------------------------------------------- 
> TypeError Traceback (most recent call last) Cell In[18], line 1 ----> 1 
> dataRdd.collect() 
> File ~\anaconda3\envs\spark_latest\Lib\site-packages\pyspark\rdd.py:
> 1814, in RDD.collect(self)  
> 1812 with SCCallSiteSync(self.context):  
> 1813 assert self.ctx._jvm is not None 
> -> 1814 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) 
>  
> 1815 return list(_load_from_socket(sock_info, self._jrdd_deserializer)) 
> File ~\anaconda3\envs\spark_latest\Lib\site-packages\pyspark\rdd.py:
> 5441, in PipelinedRDD._jrdd(self)  
> 438 else:  
> 5439 profiler = None 
> -> 5441 wrapped_func = _wrap_function(  
> 5442 self.ctx, self.func, self._prev_jrdd_deserializer, 
> self._jrdd_deserializer, profiler  
> 5443 )  
> 5445 assert self.ctx._jvm is not None  
> 5446 python_rdd = self.ctx._jvm.PythonRDD(  
> 5447 self._prev_jrdd.rdd(), wrapped_func, self.preservesPartitioning, 
> self.is_barrier  
> 5448 ) 
> File ~\anaconda3\envs\spark_latest\Lib\site-packages\pyspark\rdd.py:
> 5243, in _wrap_function(sc, func, deserializer, serializer, profiler)  
> 5241 pickled_command, broadcast_vars, env, includes = 
> _prepare_for_python_RDD(sc, command)  
> 5242 assert sc._jvm is not None 
> -> 5243 return sc._jvm.SimplePythonFunction(  
> 5244 bytearray(pickled_command),  
> 5245 env,  
> 5246 includes,  
> 5247 sc.pythonExec,  
> 5248 sc.pythonVer,  
> 5249 broadcast_vars,  
> 5250 sc._javaAccumulator,  
> 5251 ) 
> TypeError: 'JavaPackage' object is not callable{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to