Re: Spark 1.3.1 Dataframe breaking ALS.train?
The patched was merged and it will be included in 1.3.2 and 1.4.0. Thanks for reporting the bug! -Xiangrui On Tue, Apr 21, 2015 at 2:51 PM, ayan guha wrote: > Thank you all. > > On 22 Apr 2015 04:29, "Xiangrui Meng" wrote: >> >> SchemaRDD subclasses RDD in 1.2, but DataFrame is no longer an RDD in >> 1.3. We should allow DataFrames in ALS.train. I will submit a patch. >> You can use `ALS.train(training.rdd, ...)` for now as a workaround. >> -Xiangrui >> >> On Tue, Apr 21, 2015 at 10:51 AM, Joseph Bradley >> wrote: >> > Hi Ayan, >> > >> > If you want to use DataFrame, then you should use the Pipelines API >> > (org.apache.spark.ml.*) which will take DataFrames: >> > >> > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.recommendation.ALS >> > >> > In the examples/ directory for ml/, you can find a MovieLensALS example. >> > >> > Good luck! >> > Joseph >> > >> > On Tue, Apr 21, 2015 at 4:58 AM, ayan guha wrote: >> >> >> >> Hi >> >> >> >> I am getting an error >> >> >> >> Also, I am getting an error in mlib.ALS.train function when passing >> >> dataframe (do I need to convert the DF to RDD?) >> >> >> >> Code: >> >> training = ssc.sql("select userId,movieId,rating from ratings where >> >> partitionKey < 6").cache() >> >> print type(training) >> >> model = ALS.train(training,rank,numIter,lmbda) >> >> >> >> Error: >> >> >> >> >> >> Traceback (most recent call last): >> >> File "D:\Project\Spark\code\movie_sql.py", line 109, in >> >> bestConf = >> >> getBestModel(sc,ssc,training,validation,validationNoRating) >> >> File "D:\Project\Spark\code\movie_sql.py", line 54, in getBestModel >> >> model = ALS.train(trainingRDD,rank,numIter,lmbda) >> >> File >> >> >> >> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py", >> >> line 139, in train >> >> model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank, >> >> iterations, >> >> File >> >> >> >> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py", >> >> line 127, in _prepare >> >> assert isinstance(ratings, RDD), "ratings should be RDD" >> >> AssertionError: ratings should be RDD >> >> >> >> It was working fine in 1.2.0 (till last night :)) >> >> >> >> Any solution? I am thinking to map the training dataframe back to a >> >> RDD, >> >> byt will lose the schema information. >> >> >> >> Best >> >> Ayan >> >> >> >> On Mon, Apr 20, 2015 at 10:23 PM, ayan guha >> >> wrote: >> >>> >> >>> Hi >> >>> Just upgraded to Spark 1.3.1. >> >>> >> >>> I am getting an warning >> >>> >> >>> Warning (from warnings module): >> >>> File >> >>> >> >>> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\sql\context.py", >> >>> line 191 >> >>> warnings.warn("inferSchema is deprecated, please use >> >>> createDataFrame >> >>> instead") >> >>> UserWarning: inferSchema is deprecated, please use createDataFrame >> >>> instead >> >>> >> >>> However, documentation still says to use inferSchema. >> >>> Here: http://spark.apache.org/docs/latest/sql-programming-guide.htm in >> >>> section >> >>> >> >>> Also, I am getting an error in mlib.ALS.train function when passing >> >>> dataframe (do I need to convert the DF to RDD?) >> >>> >> >>> Code: >> >>> training = ssc.sql("select userId,movieId,rating from ratings where >> >>> partitionKey < 6").cache() >> >>> print type(training) >> >>> model = ALS.train(training,rank,numIter,lmbda) >> >>> >> >>> Error: >> >>> >> >>> Rank:8 Lmbda:1.0 iteration:10 >> >>> >> >>> Traceback (most recent call last): >> >>> File "D:\Project\Spark\code\movie_sql.py", line 109, in >> >>> bestConf = >> >>> getBestModel(sc,ssc,training,validation,validationNoRating) >> >>> File "D:\Project\Spark\code\movie_sql.py", line 54, in getBestModel >> >>> model = ALS.train(trainingRDD,rank,numIter,lmbda) >> >>> File >> >>> >> >>> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py", >> >>> line 139, in train >> >>> model = callMLlibFunc("trainALSModel", cls._prepare(ratings), >> >>> rank, >> >>> iterations, >> >>> File >> >>> >> >>> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py", >> >>> line 127, in _prepare >> >>> assert isinstance(ratings, RDD), "ratings should be RDD" >> >>> AssertionError: ratings should be RDD >> >>> >> >>> -- >> >>> Best Regards, >> >>> Ayan Guha >> >> >> >> >> >> >> >> >> >> -- >> >> Best Regards, >> >> Ayan Guha >> > >> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.3.1 Dataframe breaking ALS.train?
Thank you all. On 22 Apr 2015 04:29, "Xiangrui Meng" wrote: > SchemaRDD subclasses RDD in 1.2, but DataFrame is no longer an RDD in > 1.3. We should allow DataFrames in ALS.train. I will submit a patch. > You can use `ALS.train(training.rdd, ...)` for now as a workaround. > -Xiangrui > > On Tue, Apr 21, 2015 at 10:51 AM, Joseph Bradley > wrote: > > Hi Ayan, > > > > If you want to use DataFrame, then you should use the Pipelines API > > (org.apache.spark.ml.*) which will take DataFrames: > > > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.recommendation.ALS > > > > In the examples/ directory for ml/, you can find a MovieLensALS example. > > > > Good luck! > > Joseph > > > > On Tue, Apr 21, 2015 at 4:58 AM, ayan guha wrote: > >> > >> Hi > >> > >> I am getting an error > >> > >> Also, I am getting an error in mlib.ALS.train function when passing > >> dataframe (do I need to convert the DF to RDD?) > >> > >> Code: > >> training = ssc.sql("select userId,movieId,rating from ratings where > >> partitionKey < 6").cache() > >> print type(training) > >> model = ALS.train(training,rank,numIter,lmbda) > >> > >> Error: > >> > >> > >> Traceback (most recent call last): > >> File "D:\Project\Spark\code\movie_sql.py", line 109, in > >> bestConf = > getBestModel(sc,ssc,training,validation,validationNoRating) > >> File "D:\Project\Spark\code\movie_sql.py", line 54, in getBestModel > >> model = ALS.train(trainingRDD,rank,numIter,lmbda) > >> File > >> > "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py", > >> line 139, in train > >> model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank, > >> iterations, > >> File > >> > "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py", > >> line 127, in _prepare > >> assert isinstance(ratings, RDD), "ratings should be RDD" > >> AssertionError: ratings should be RDD > >> > >> It was working fine in 1.2.0 (till last night :)) > >> > >> Any solution? I am thinking to map the training dataframe back to a RDD, > >> byt will lose the schema information. > >> > >> Best > >> Ayan > >> > >> On Mon, Apr 20, 2015 at 10:23 PM, ayan guha > wrote: > >>> > >>> Hi > >>> Just upgraded to Spark 1.3.1. > >>> > >>> I am getting an warning > >>> > >>> Warning (from warnings module): > >>> File > >>> > "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\sql\context.py", > >>> line 191 > >>> warnings.warn("inferSchema is deprecated, please use > createDataFrame > >>> instead") > >>> UserWarning: inferSchema is deprecated, please use createDataFrame > >>> instead > >>> > >>> However, documentation still says to use inferSchema. > >>> Here: http://spark.apache.org/docs/latest/sql-programming-guide.htm in > >>> section > >>> > >>> Also, I am getting an error in mlib.ALS.train function when passing > >>> dataframe (do I need to convert the DF to RDD?) > >>> > >>> Code: > >>> training = ssc.sql("select userId,movieId,rating from ratings where > >>> partitionKey < 6").cache() > >>> print type(training) > >>> model = ALS.train(training,rank,numIter,lmbda) > >>> > >>> Error: > >>> > >>> Rank:8 Lmbda:1.0 iteration:10 > >>> > >>> Traceback (most recent call last): > >>> File "D:\Project\Spark\code\movie_sql.py", line 109, in > >>> bestConf = > >>> getBestModel(sc,ssc,training,validation,validationNoRating) > >>> File "D:\Project\Spark\code\movie_sql.py", line 54, in getBestModel > >>> model = ALS.train(trainingRDD,rank,numIter,lmbda) > >>> File > >>> > "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py", > >>> line 139, in train > >>> model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank, > >>> iterations, > >>> File > >>> > "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py", > >>> line 127, in _prepare > >>> assert isinstance(ratings, RDD), "ratings should be RDD" > >>> AssertionError: ratings should be RDD > >>> > >>> -- > >>> Best Regards, > >>> Ayan Guha > >> > >> > >> > >> > >> -- > >> Best Regards, > >> Ayan Guha > > > > >
Re: Spark 1.3.1 Dataframe breaking ALS.train?
SchemaRDD subclasses RDD in 1.2, but DataFrame is no longer an RDD in 1.3. We should allow DataFrames in ALS.train. I will submit a patch. You can use `ALS.train(training.rdd, ...)` for now as a workaround. -Xiangrui On Tue, Apr 21, 2015 at 10:51 AM, Joseph Bradley wrote: > Hi Ayan, > > If you want to use DataFrame, then you should use the Pipelines API > (org.apache.spark.ml.*) which will take DataFrames: > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.recommendation.ALS > > In the examples/ directory for ml/, you can find a MovieLensALS example. > > Good luck! > Joseph > > On Tue, Apr 21, 2015 at 4:58 AM, ayan guha wrote: >> >> Hi >> >> I am getting an error >> >> Also, I am getting an error in mlib.ALS.train function when passing >> dataframe (do I need to convert the DF to RDD?) >> >> Code: >> training = ssc.sql("select userId,movieId,rating from ratings where >> partitionKey < 6").cache() >> print type(training) >> model = ALS.train(training,rank,numIter,lmbda) >> >> Error: >> >> >> Traceback (most recent call last): >> File "D:\Project\Spark\code\movie_sql.py", line 109, in >> bestConf = getBestModel(sc,ssc,training,validation,validationNoRating) >> File "D:\Project\Spark\code\movie_sql.py", line 54, in getBestModel >> model = ALS.train(trainingRDD,rank,numIter,lmbda) >> File >> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py", >> line 139, in train >> model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank, >> iterations, >> File >> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py", >> line 127, in _prepare >> assert isinstance(ratings, RDD), "ratings should be RDD" >> AssertionError: ratings should be RDD >> >> It was working fine in 1.2.0 (till last night :)) >> >> Any solution? I am thinking to map the training dataframe back to a RDD, >> byt will lose the schema information. >> >> Best >> Ayan >> >> On Mon, Apr 20, 2015 at 10:23 PM, ayan guha wrote: >>> >>> Hi >>> Just upgraded to Spark 1.3.1. >>> >>> I am getting an warning >>> >>> Warning (from warnings module): >>> File >>> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\sql\context.py", >>> line 191 >>> warnings.warn("inferSchema is deprecated, please use createDataFrame >>> instead") >>> UserWarning: inferSchema is deprecated, please use createDataFrame >>> instead >>> >>> However, documentation still says to use inferSchema. >>> Here: http://spark.apache.org/docs/latest/sql-programming-guide.htm in >>> section >>> >>> Also, I am getting an error in mlib.ALS.train function when passing >>> dataframe (do I need to convert the DF to RDD?) >>> >>> Code: >>> training = ssc.sql("select userId,movieId,rating from ratings where >>> partitionKey < 6").cache() >>> print type(training) >>> model = ALS.train(training,rank,numIter,lmbda) >>> >>> Error: >>> >>> Rank:8 Lmbda:1.0 iteration:10 >>> >>> Traceback (most recent call last): >>> File "D:\Project\Spark\code\movie_sql.py", line 109, in >>> bestConf = >>> getBestModel(sc,ssc,training,validation,validationNoRating) >>> File "D:\Project\Spark\code\movie_sql.py", line 54, in getBestModel >>> model = ALS.train(trainingRDD,rank,numIter,lmbda) >>> File >>> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py", >>> line 139, in train >>> model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank, >>> iterations, >>> File >>> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py", >>> line 127, in _prepare >>> assert isinstance(ratings, RDD), "ratings should be RDD" >>> AssertionError: ratings should be RDD >>> >>> -- >>> Best Regards, >>> Ayan Guha >> >> >> >> >> -- >> Best Regards, >> Ayan Guha > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.3.1 Dataframe breaking ALS.train?
Hi Ayan, If you want to use DataFrame, then you should use the Pipelines API (org.apache.spark.ml.*) which will take DataFrames: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.recommendation.ALS In the examples/ directory for ml/, you can find a MovieLensALS example. Good luck! Joseph On Tue, Apr 21, 2015 at 4:58 AM, ayan guha wrote: > Hi > > I am getting an error > > Also, I am getting an error in mlib.ALS.train function when passing > dataframe (do I need to convert the DF to RDD?) > > Code: > training = ssc.sql("select userId,movieId,rating from ratings where > partitionKey < 6").cache() > print type(training) > model = ALS.train(training,rank,numIter,lmbda) > > Error: > > > Traceback (most recent call last): > File "D:\Project\Spark\code\movie_sql.py", line 109, in > bestConf = getBestModel(sc,ssc,training,validation,validationNoRating) > File "D:\Project\Spark\code\movie_sql.py", line 54, in getBestModel > model = ALS.train(trainingRDD,rank,numIter,lmbda) > File > "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py", > line 139, in train > model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank, > iterations, > File > "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py", > line 127, in _prepare > assert isinstance(ratings, RDD), "ratings should be RDD" > AssertionError: ratings should be RDD > > It was working fine in 1.2.0 (till last night :)) > > Any solution? I am thinking to map the training dataframe back to a RDD, > byt will lose the schema information. > > Best > Ayan > > On Mon, Apr 20, 2015 at 10:23 PM, ayan guha wrote: > >> Hi >> Just upgraded to Spark 1.3.1. >> >> I am getting an warning >> >> Warning (from warnings module): >> File >> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\sql\context.py", >> line 191 >> warnings.warn("inferSchema is deprecated, please use createDataFrame >> instead") >> UserWarning: inferSchema is deprecated, please use createDataFrame instead >> >> However, documentation still says to use inferSchema. >> Here: http://spark.apache.org/docs/latest/sql-programming-guide.htm in >> section >> >> Also, I am getting an error in mlib.ALS.train function when passing >> dataframe (do I need to convert the DF to RDD?) >> >> Code: >> training = ssc.sql("select userId,movieId,rating from ratings where >> partitionKey < 6").cache() >> print type(training) >> model = ALS.train(training,rank,numIter,lmbda) >> >> Error: >> >> Rank:8 Lmbda:1.0 iteration:10 >> >> Traceback (most recent call last): >> File "D:\Project\Spark\code\movie_sql.py", line 109, in >> bestConf = getBestModel(sc,ssc,training,validation,validationNoRating) >> File "D:\Project\Spark\code\movie_sql.py", line 54, in getBestModel >> model = ALS.train(trainingRDD,rank,numIter,lmbda) >> File >> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py", >> line 139, in train >> model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank, >> iterations, >> File >> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py", >> line 127, in _prepare >> assert isinstance(ratings, RDD), "ratings should be RDD" >> AssertionError: ratings should be RDD >> >> -- >> Best Regards, >> Ayan Guha >> > > > > -- > Best Regards, > Ayan Guha >