Re: I Want to Help with MLlib Migration

2018-02-16 Thread Yacine Mazari
Thanks for the clarification and suggestion @weichen.
I will try to benchmark it and share the results for discussion.

Regards,
Yacine.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: I Want to Help with MLlib Migration

2018-02-16 Thread Weichen Xu
>>The goal is to have these algorithms implemented using the Dataset API.
Currently, the implementation of these classes/algorithms uses RDDs by
wrapping the old (mllib) classes, which will eventually be deprecated (and
deleted).

It need discussion and test for each algorithm before doing that. Simply
migrating to Dataframe implementation is possible to bring performance
regression.
If you have already implemented some algos on dataframe API and found it
bring performance improvement, then you can create JIRA and I will join
discussion.
Thanks!


On Thu, Feb 15, 2018 at 10:39 PM, Yacine Mazari  wrote:

> Thanks for the reply @srowen.
>
> >>I don't think you can move or alter the class APis.
> Agreed. That's not my intention at all.
>
> >>There also isn't much value in copying the code. Maybe there are
> opportunities for moving some internal code.
> There will probably be some copying and moving internal code, but this is
> not the main purpose.
> The goal is to have these algorithms implemented using the Dataset API.
> Currently, the implementation of these classes/algorithms uses RDDs by
> wrapping the old (mllib) classes, which will eventually be deprecated (and
> deleted).
>
> >>But in general I think all this has to wait.
> Do you have any schedule or plan in mind? If deprecation is targeted for
> 3.0, then we roughly have 1.5 years.
> On the other-hand, the current situation prevents us from making
> improvements to the existing classes, for example I'd like to add
> maxDocFreq
> to ml.feature.IDF to make it similar to scikit-learn, but that's hard to do
> because it's just a wrapper mllib.feature.IDF,
>
>
> Thank you for the discussion.
> Yacine.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: I Want to Help with MLlib Migration

2018-02-15 Thread Yacine Mazari
Thanks for the reply @srowen.

>>I don't think you can move or alter the class APis. 
Agreed. That's not my intention at all.

>>There also isn't much value in copying the code. Maybe there are
opportunities for moving some internal code.
There will probably be some copying and moving internal code, but this is
not the main purpose. 
The goal is to have these algorithms implemented using the Dataset API. 
Currently, the implementation of these classes/algorithms uses RDDs by
wrapping the old (mllib) classes, which will eventually be deprecated (and
deleted).

>>But in general I think all this has to wait. 
Do you have any schedule or plan in mind? If deprecation is targeted for
3.0, then we roughly have 1.5 years.
On the other-hand, the current situation prevents us from making
improvements to the existing classes, for example I'd like to add maxDocFreq
to ml.feature.IDF to make it similar to scikit-learn, but that's hard to do
because it's just a wrapper mllib.feature.IDF,


Thank you for the discussion.
Yacine.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: I Want to Help with MLlib Migration

2018-02-15 Thread Sean Owen
I don't think you can move or alter the class APis. There also isn't much
value in copying the code. Maybe there are opportunities for moving some
internal code. But in general I think all this has to wait.

On Thu, Feb 15, 2018, 5:17 AM Yacine Mazari  wrote:

> Hi,
>
> I see that many classes under "org.apache.spark.ml" are still referring to
> the "org.apache.spark.mllib" implementation.
>
> While there still is time until the deprecation deadline by version 3.0,
> having these dependencies makes it impossible or difficult to make
> improvements to these classes.
>
> I see that IDF and HashingTF are being migrated in SPARK-22531 and
> SPARK-21748.
>
> 1) Should I go ahead and start with one of the following: BisectingKMeans,
> KMeans, LDA, PCA, Word2Vec,
>  FPGrowth? I don't see any ongoing work with them.
>
> 2) I suggest creating an umbrella ticket to track this migration. (I
> haven't
> seen any)
>
> What do you think?
>
> Best Regards,
> Yacine.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


I Want to Help with MLlib Migration

2018-02-15 Thread Yacine Mazari
Hi,

I see that many classes under "org.apache.spark.ml" are still referring to
the "org.apache.spark.mllib" implementation.

While there still is time until the deprecation deadline by version 3.0,
having these dependencies makes it impossible or difficult to make
improvements to these classes.

I see that IDF and HashingTF are being migrated in SPARK-22531 and
SPARK-21748.

1) Should I go ahead and start with one of the following: BisectingKMeans,
KMeans, LDA, PCA, Word2Vec,
 FPGrowth? I don't see any ongoing work with them.

2) I suggest creating an umbrella ticket to track this migration. (I haven't
seen any)

What do you think?

Best Regards,
Yacine.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org