Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-22 Thread Xiangrui Meng
The patched was merged and it will be included in 1.3.2 and 1.4.0.
Thanks for reporting the bug! -Xiangrui

On Tue, Apr 21, 2015 at 2:51 PM, ayan guha  wrote:
> Thank you all.
>
> On 22 Apr 2015 04:29, "Xiangrui Meng"  wrote:
>>
>> SchemaRDD subclasses RDD in 1.2, but DataFrame is no longer an RDD in
>> 1.3. We should allow DataFrames in ALS.train. I will submit a patch.
>> You can use `ALS.train(training.rdd, ...)` for now as a workaround.
>> -Xiangrui
>>
>> On Tue, Apr 21, 2015 at 10:51 AM, Joseph Bradley 
>> wrote:
>> > Hi Ayan,
>> >
>> > If you want to use DataFrame, then you should use the Pipelines API
>> > (org.apache.spark.ml.*) which will take DataFrames:
>> >
>> > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.recommendation.ALS
>> >
>> > In the examples/ directory for ml/, you can find a MovieLensALS example.
>> >
>> > Good luck!
>> > Joseph
>> >
>> > On Tue, Apr 21, 2015 at 4:58 AM, ayan guha  wrote:
>> >>
>> >> Hi
>> >>
>> >> I am getting an error
>> >>
>> >> Also, I am getting an error in mlib.ALS.train function when passing
>> >> dataframe (do I need to convert the DF to RDD?)
>> >>
>> >> Code:
>> >> training = ssc.sql("select userId,movieId,rating from ratings where
>> >> partitionKey < 6").cache()
>> >> print type(training)
>> >> model = ALS.train(training,rank,numIter,lmbda)
>> >>
>> >> Error:
>> >> 
>> >>
>> >> Traceback (most recent call last):
>> >>   File "D:\Project\Spark\code\movie_sql.py", line 109, in 
>> >> bestConf =
>> >> getBestModel(sc,ssc,training,validation,validationNoRating)
>> >>   File "D:\Project\Spark\code\movie_sql.py", line 54, in getBestModel
>> >> model = ALS.train(trainingRDD,rank,numIter,lmbda)
>> >>   File
>> >>
>> >> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
>> >> line 139, in train
>> >> model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank,
>> >> iterations,
>> >>   File
>> >>
>> >> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
>> >> line 127, in _prepare
>> >> assert isinstance(ratings, RDD), "ratings should be RDD"
>> >> AssertionError: ratings should be RDD
>> >>
>> >> It was working fine in 1.2.0 (till last night :))
>> >>
>> >> Any solution? I am thinking to map the training dataframe back to a
>> >> RDD,
>> >> byt will lose the schema information.
>> >>
>> >> Best
>> >> Ayan
>> >>
>> >> On Mon, Apr 20, 2015 at 10:23 PM, ayan guha 
>> >> wrote:
>> >>>
>> >>> Hi
>> >>> Just upgraded to Spark 1.3.1.
>> >>>
>> >>> I am getting an warning
>> >>>
>> >>> Warning (from warnings module):
>> >>>   File
>> >>>
>> >>> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\sql\context.py",
>> >>> line 191
>> >>> warnings.warn("inferSchema is deprecated, please use
>> >>> createDataFrame
>> >>> instead")
>> >>> UserWarning: inferSchema is deprecated, please use createDataFrame
>> >>> instead
>> >>>
>> >>> However, documentation still says to use inferSchema.
>> >>> Here: http://spark.apache.org/docs/latest/sql-programming-guide.htm in
>> >>> section
>> >>>
>> >>> Also, I am getting an error in mlib.ALS.train function when passing
>> >>> dataframe (do I need to convert the DF to RDD?)
>> >>>
>> >>> Code:
>> >>> training = ssc.sql("select userId,movieId,rating from ratings where
>> >>> partitionKey < 6").cache()
>> >>> print type(training)
>> >>> model = ALS.train(training,rank,numIter,lmbda)
>> >>>
>> >>> Error:
>> >>> 
>> >>> Rank:8 Lmbda:1.0 iteration:10
>> >>>
>> >>> Traceback (most recent call last):
>> >>>   File "D:\Project\Spark\code\movie_sql.py", line 109, in 
>> >>> bestConf =
>> >>> getBestModel(sc,ssc,training,validation,validationNoRating)
>> >>>   File "D:\Project\Spark\code\movie_sql.py", line 54, in getBestModel
>> >>> model = ALS.train(trainingRDD,rank,numIter,lmbda)
>> >>>   File
>> >>>
>> >>> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
>> >>> line 139, in train
>> >>> model = callMLlibFunc("trainALSModel", cls._prepare(ratings),
>> >>> rank,
>> >>> iterations,
>> >>>   File
>> >>>
>> >>> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
>> >>> line 127, in _prepare
>> >>> assert isinstance(ratings, RDD), "ratings should be RDD"
>> >>> AssertionError: ratings should be RDD
>> >>>
>> >>> --
>> >>> Best Regards,
>> >>> Ayan Guha
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards,
>> >> Ayan Guha
>> >
>> >

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-21 Thread ayan guha
Thank you all.
On 22 Apr 2015 04:29, "Xiangrui Meng"  wrote:

> SchemaRDD subclasses RDD in 1.2, but DataFrame is no longer an RDD in
> 1.3. We should allow DataFrames in ALS.train. I will submit a patch.
> You can use `ALS.train(training.rdd, ...)` for now as a workaround.
> -Xiangrui
>
> On Tue, Apr 21, 2015 at 10:51 AM, Joseph Bradley 
> wrote:
> > Hi Ayan,
> >
> > If you want to use DataFrame, then you should use the Pipelines API
> > (org.apache.spark.ml.*) which will take DataFrames:
> >
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.recommendation.ALS
> >
> > In the examples/ directory for ml/, you can find a MovieLensALS example.
> >
> > Good luck!
> > Joseph
> >
> > On Tue, Apr 21, 2015 at 4:58 AM, ayan guha  wrote:
> >>
> >> Hi
> >>
> >> I am getting an error
> >>
> >> Also, I am getting an error in mlib.ALS.train function when passing
> >> dataframe (do I need to convert the DF to RDD?)
> >>
> >> Code:
> >> training = ssc.sql("select userId,movieId,rating from ratings where
> >> partitionKey < 6").cache()
> >> print type(training)
> >> model = ALS.train(training,rank,numIter,lmbda)
> >>
> >> Error:
> >> 
> >>
> >> Traceback (most recent call last):
> >>   File "D:\Project\Spark\code\movie_sql.py", line 109, in 
> >> bestConf =
> getBestModel(sc,ssc,training,validation,validationNoRating)
> >>   File "D:\Project\Spark\code\movie_sql.py", line 54, in getBestModel
> >> model = ALS.train(trainingRDD,rank,numIter,lmbda)
> >>   File
> >>
> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
> >> line 139, in train
> >> model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank,
> >> iterations,
> >>   File
> >>
> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
> >> line 127, in _prepare
> >> assert isinstance(ratings, RDD), "ratings should be RDD"
> >> AssertionError: ratings should be RDD
> >>
> >> It was working fine in 1.2.0 (till last night :))
> >>
> >> Any solution? I am thinking to map the training dataframe back to a RDD,
> >> byt will lose the schema information.
> >>
> >> Best
> >> Ayan
> >>
> >> On Mon, Apr 20, 2015 at 10:23 PM, ayan guha 
> wrote:
> >>>
> >>> Hi
> >>> Just upgraded to Spark 1.3.1.
> >>>
> >>> I am getting an warning
> >>>
> >>> Warning (from warnings module):
> >>>   File
> >>>
> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\sql\context.py",
> >>> line 191
> >>> warnings.warn("inferSchema is deprecated, please use
> createDataFrame
> >>> instead")
> >>> UserWarning: inferSchema is deprecated, please use createDataFrame
> >>> instead
> >>>
> >>> However, documentation still says to use inferSchema.
> >>> Here: http://spark.apache.org/docs/latest/sql-programming-guide.htm in
> >>> section
> >>>
> >>> Also, I am getting an error in mlib.ALS.train function when passing
> >>> dataframe (do I need to convert the DF to RDD?)
> >>>
> >>> Code:
> >>> training = ssc.sql("select userId,movieId,rating from ratings where
> >>> partitionKey < 6").cache()
> >>> print type(training)
> >>> model = ALS.train(training,rank,numIter,lmbda)
> >>>
> >>> Error:
> >>> 
> >>> Rank:8 Lmbda:1.0 iteration:10
> >>>
> >>> Traceback (most recent call last):
> >>>   File "D:\Project\Spark\code\movie_sql.py", line 109, in 
> >>> bestConf =
> >>> getBestModel(sc,ssc,training,validation,validationNoRating)
> >>>   File "D:\Project\Spark\code\movie_sql.py", line 54, in getBestModel
> >>> model = ALS.train(trainingRDD,rank,numIter,lmbda)
> >>>   File
> >>>
> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
> >>> line 139, in train
> >>> model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank,
> >>> iterations,
> >>>   File
> >>>
> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
> >>> line 127, in _prepare
> >>> assert isinstance(ratings, RDD), "ratings should be RDD"
> >>> AssertionError: ratings should be RDD
> >>>
> >>> --
> >>> Best Regards,
> >>> Ayan Guha
> >>
> >>
> >>
> >>
> >> --
> >> Best Regards,
> >> Ayan Guha
> >
> >
>


Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-21 Thread Xiangrui Meng
SchemaRDD subclasses RDD in 1.2, but DataFrame is no longer an RDD in
1.3. We should allow DataFrames in ALS.train. I will submit a patch.
You can use `ALS.train(training.rdd, ...)` for now as a workaround.
-Xiangrui

On Tue, Apr 21, 2015 at 10:51 AM, Joseph Bradley  wrote:
> Hi Ayan,
>
> If you want to use DataFrame, then you should use the Pipelines API
> (org.apache.spark.ml.*) which will take DataFrames:
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.recommendation.ALS
>
> In the examples/ directory for ml/, you can find a MovieLensALS example.
>
> Good luck!
> Joseph
>
> On Tue, Apr 21, 2015 at 4:58 AM, ayan guha  wrote:
>>
>> Hi
>>
>> I am getting an error
>>
>> Also, I am getting an error in mlib.ALS.train function when passing
>> dataframe (do I need to convert the DF to RDD?)
>>
>> Code:
>> training = ssc.sql("select userId,movieId,rating from ratings where
>> partitionKey < 6").cache()
>> print type(training)
>> model = ALS.train(training,rank,numIter,lmbda)
>>
>> Error:
>> 
>>
>> Traceback (most recent call last):
>>   File "D:\Project\Spark\code\movie_sql.py", line 109, in 
>> bestConf = getBestModel(sc,ssc,training,validation,validationNoRating)
>>   File "D:\Project\Spark\code\movie_sql.py", line 54, in getBestModel
>> model = ALS.train(trainingRDD,rank,numIter,lmbda)
>>   File
>> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
>> line 139, in train
>> model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank,
>> iterations,
>>   File
>> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
>> line 127, in _prepare
>> assert isinstance(ratings, RDD), "ratings should be RDD"
>> AssertionError: ratings should be RDD
>>
>> It was working fine in 1.2.0 (till last night :))
>>
>> Any solution? I am thinking to map the training dataframe back to a RDD,
>> byt will lose the schema information.
>>
>> Best
>> Ayan
>>
>> On Mon, Apr 20, 2015 at 10:23 PM, ayan guha  wrote:
>>>
>>> Hi
>>> Just upgraded to Spark 1.3.1.
>>>
>>> I am getting an warning
>>>
>>> Warning (from warnings module):
>>>   File
>>> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\sql\context.py",
>>> line 191
>>> warnings.warn("inferSchema is deprecated, please use createDataFrame
>>> instead")
>>> UserWarning: inferSchema is deprecated, please use createDataFrame
>>> instead
>>>
>>> However, documentation still says to use inferSchema.
>>> Here: http://spark.apache.org/docs/latest/sql-programming-guide.htm in
>>> section
>>>
>>> Also, I am getting an error in mlib.ALS.train function when passing
>>> dataframe (do I need to convert the DF to RDD?)
>>>
>>> Code:
>>> training = ssc.sql("select userId,movieId,rating from ratings where
>>> partitionKey < 6").cache()
>>> print type(training)
>>> model = ALS.train(training,rank,numIter,lmbda)
>>>
>>> Error:
>>> 
>>> Rank:8 Lmbda:1.0 iteration:10
>>>
>>> Traceback (most recent call last):
>>>   File "D:\Project\Spark\code\movie_sql.py", line 109, in 
>>> bestConf =
>>> getBestModel(sc,ssc,training,validation,validationNoRating)
>>>   File "D:\Project\Spark\code\movie_sql.py", line 54, in getBestModel
>>> model = ALS.train(trainingRDD,rank,numIter,lmbda)
>>>   File
>>> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
>>> line 139, in train
>>> model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank,
>>> iterations,
>>>   File
>>> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
>>> line 127, in _prepare
>>> assert isinstance(ratings, RDD), "ratings should be RDD"
>>> AssertionError: ratings should be RDD
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-21 Thread Joseph Bradley
Hi Ayan,

If you want to use DataFrame, then you should use the Pipelines API
(org.apache.spark.ml.*) which will take DataFrames:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.recommendation.ALS

In the examples/ directory for ml/, you can find a MovieLensALS example.

Good luck!
Joseph

On Tue, Apr 21, 2015 at 4:58 AM, ayan guha  wrote:

> Hi
>
> I am getting an error
>
> Also, I am getting an error in mlib.ALS.train function when passing
> dataframe (do I need to convert the DF to RDD?)
>
> Code:
> training = ssc.sql("select userId,movieId,rating from ratings where
> partitionKey < 6").cache()
> print type(training)
> model = ALS.train(training,rank,numIter,lmbda)
>
> Error:
> 
>
> Traceback (most recent call last):
>   File "D:\Project\Spark\code\movie_sql.py", line 109, in 
> bestConf = getBestModel(sc,ssc,training,validation,validationNoRating)
>   File "D:\Project\Spark\code\movie_sql.py", line 54, in getBestModel
> model = ALS.train(trainingRDD,rank,numIter,lmbda)
>   File
> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
> line 139, in train
> model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank,
> iterations,
>   File
> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
> line 127, in _prepare
> assert isinstance(ratings, RDD), "ratings should be RDD"
> AssertionError: ratings should be RDD
>
> It was working fine in 1.2.0 (till last night :))
>
> Any solution? I am thinking to map the training dataframe back to a RDD,
> byt will lose the schema information.
>
> Best
> Ayan
>
> On Mon, Apr 20, 2015 at 10:23 PM, ayan guha  wrote:
>
>> Hi
>> Just upgraded to Spark 1.3.1.
>>
>> I am getting an warning
>>
>> Warning (from warnings module):
>>   File
>> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\sql\context.py",
>> line 191
>> warnings.warn("inferSchema is deprecated, please use createDataFrame
>> instead")
>> UserWarning: inferSchema is deprecated, please use createDataFrame instead
>>
>> However, documentation still says to use inferSchema.
>> Here: http://spark.apache.org/docs/latest/sql-programming-guide.htm in
>> section
>>
>> Also, I am getting an error in mlib.ALS.train function when passing
>> dataframe (do I need to convert the DF to RDD?)
>>
>> Code:
>> training = ssc.sql("select userId,movieId,rating from ratings where
>> partitionKey < 6").cache()
>> print type(training)
>> model = ALS.train(training,rank,numIter,lmbda)
>>
>> Error:
>> 
>> Rank:8 Lmbda:1.0 iteration:10
>>
>> Traceback (most recent call last):
>>   File "D:\Project\Spark\code\movie_sql.py", line 109, in 
>> bestConf = getBestModel(sc,ssc,training,validation,validationNoRating)
>>   File "D:\Project\Spark\code\movie_sql.py", line 54, in getBestModel
>> model = ALS.train(trainingRDD,rank,numIter,lmbda)
>>   File
>> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
>> line 139, in train
>> model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank,
>> iterations,
>>   File
>> "D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py",
>> line 127, in _prepare
>> assert isinstance(ratings, RDD), "ratings should be RDD"
>> AssertionError: ratings should be RDD
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>