[ https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-24447. ---------------------------------- Resolution: Incomplete > Pyspark RowMatrix.columnSimilarities() loses spark context > ---------------------------------------------------------- > > Key: SPARK-24447 > URL: https://issues.apache.org/jira/browse/SPARK-24447 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark > Affects Versions: 2.3.0 > Reporter: Perry Chu > Priority: Minor > Labels: bulk-closed > > The RDD behind the CoordinateMatrix returned by > RowMatrix.columnSimilarities() appears to be losing track of the spark > context if spark is stopped and restarted in pyspark. > I'm pretty new to spark - not sure if the problem is on the python side or > the scala side - would appreciate someone more experienced taking a look. > This snippet should reproduce the error: > {code:java} > import pyspark > from pyspark.mllib.linalg.distributed import RowMatrix > spark.stop() > spark = pyspark.sql.SparkSession.builder.getOrCreate() > rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]]) > matrix = RowMatrix(rows) > sims = matrix.columnSimilarities() > ## This works, prints "3 3" as expected (3 columns = 3x3 matrix) > print(sims.numRows(),sims.numCols()) > ## This throws an error (stack trace below) > print(sims.entries.first()) > ## Later I tried this > print(rows.context) #<SparkContext master=yarn appName=Spark ML Pipeline> > print(sims.entries.context) #<SparkContext master=yarn appName = > PySparkShell>, then throws an error{code} > Error stack trace > {code:java} > --------------------------------------------------------------------------- > AttributeError Traceback (most recent call last) > <ipython-input-47-50f83a6cf449> in <module>() > ----> 1 sims.entries.first() > /usr/lib/spark/python/pyspark/rdd.py in first(self) > 1374 ValueError: RDD is empty > 1375 """ > -> 1376 rs = self.take(1) > 1377 if rs: > 1378 return rs[0] > /usr/lib/spark/python/pyspark/rdd.py in take(self, num) > 1356 > 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts)) > -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p) > 1359 > 1360 items += res > /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, > partitions, allowLocal) > 999 # SparkContext#runJob. > 1000 mappedRDD = rdd.mapPartitions(partitionFunc) > -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) > 1003 > AttributeError: 'NoneType' object has no attribute 'sc' > {code} > PySpark columnSimilarities documentation > [http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org