[ https://issues.apache.org/jira/browse/SPARK-46636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shubham Patil updated SPARK-46636: ---------------------------------- Affects Version/s: 3.4.1 (was: 3.3.4) > Pyspark throwing TypeError while collecting a RDD > ------------------------------------------------- > > Key: SPARK-46636 > URL: https://issues.apache.org/jira/browse/SPARK-46636 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.4.1 > Environment: Running this in anaconda jupyter notebook > Python== 3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) > [MSC v.1916 64 bit (AMD64)] > Spark== 3.3.4 > pyspark== 3.4.1 > Reporter: Shubham Patil > Priority: Major > > Im trying to collect a RDD after applying a filter on it but its throwing an > error. > > Error can be reproduced from below code > {code:java} > from pyspark.sql import SparkSession > spark = > SparkSession.builder.master("local[*]").appName("Practice").getOrCreate() > sc = spark.sparkContext > data = [1,2,3,4,5,6,7,8,9,10,11,12] > dataRdd = sc.parallelize(data) > dataRdd = dataRdd.filter(lambda a: a%2==0) > dataRdd.collect() {code} > Below is the error that its throwing: > > {code:java} > --------------------------------------------------------------------------- > TypeError Traceback (most recent call last) Cell In[18], line 1 ----> 1 > dataRdd.collect() > File ~\anaconda3\envs\spark_latest\Lib\site-packages\pyspark\rdd.py: > 1814, in RDD.collect(self) > 1812 with SCCallSiteSync(self.context): > 1813 assert self.ctx._jvm is not None > -> 1814 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) > > 1815 return list(_load_from_socket(sock_info, self._jrdd_deserializer)) > File ~\anaconda3\envs\spark_latest\Lib\site-packages\pyspark\rdd.py: > 5441, in PipelinedRDD._jrdd(self) > 438 else: > 5439 profiler = None > -> 5441 wrapped_func = _wrap_function( > 5442 self.ctx, self.func, self._prev_jrdd_deserializer, > self._jrdd_deserializer, profiler > 5443 ) > 5445 assert self.ctx._jvm is not None > 5446 python_rdd = self.ctx._jvm.PythonRDD( > 5447 self._prev_jrdd.rdd(), wrapped_func, self.preservesPartitioning, > self.is_barrier > 5448 ) > File ~\anaconda3\envs\spark_latest\Lib\site-packages\pyspark\rdd.py: > 5243, in _wrap_function(sc, func, deserializer, serializer, profiler) > 5241 pickled_command, broadcast_vars, env, includes = > _prepare_for_python_RDD(sc, command) > 5242 assert sc._jvm is not None > -> 5243 return sc._jvm.SimplePythonFunction( > 5244 bytearray(pickled_command), > 5245 env, > 5246 includes, > 5247 sc.pythonExec, > 5248 sc.pythonVer, > 5249 broadcast_vars, > 5250 sc._javaAccumulator, > 5251 ) > TypeError: 'JavaPackage' object is not callable{code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org