Usually spark ml Models specify the columns they use for training. i.e. you
would only select your columns (X) for model training but metadata i.e.
target labels or your date column  (y) would still be present for each row.

<johan.grande....@orange.com> schrieb am Do., 14. Sep. 2017 um 10:42 Uhr:

> In several situations I would like to zip RDDs knowing that their order
> matches. In particular I’m using an MLLib KMeansModel on an RDD of Vectors
> so I would like to do:
>
>
>
> myData.zip(myModel.predict(myData))
>
>
>
> Also the first column in my RDD is a timestamp which I don’t want to be a
> part of the model, so in fact I would like to split the first column out of
> my RDD, then do:
>
>
>
> myData.zip(myModel.predict(myData.map(dropTimestamp)))
>
>
>
> Moreover I’d like my data to be scaled and go through a principal
> component analysis first, so the main steps would be like:
>
>
>
> val noTs = myData.map(dropTimestamp)
>
> val scaled = scaler.transform(noTs)
>
> val projected = (new RowMatrix(scaled)).multiply(principalComponents).rows
>
> val clusters = myModel.predict(projected)
>
> val result = myData.zip(clusters)
>
>
>
> Do you think there’s a chance that the 4 transformations above would
> preserve order so the zip at the end would be correct?
>
>
>
>
>
> On 2017-09-13 19:51 CEST, lucas.g...@gmail.com wrote :
>
>
>
> I'm wondering why you need order preserved, we've had situations where
> keeping the source as an artificial field in the dataset was important and
> I had to run contortions to inject that (In this case the datasource had no
> unique key).
>
>
>
> Is this similar?
>
>
>
> On 13 September 2017 at 10:46, Suzen, Mehmet <su...@acm.org> wrote:
>
> But what happens if one of the partitions fail, how fault tolarence
> recover elements in other partitions.
>
>
>
> On 13 Sep 2017 18:39, "Ankit Maloo" <ankitmaloo1...@gmail.com> wrote:
>
> AFAIK, the order of a rdd is maintained across a partition for Map
> operations. There is no way a map operation  can change sequence across a
> partition as partition is local and computation happens one record at a
> time.
>
>
>
> On 13-Sep-2017 9:54 PM, "Suzen, Mehmet" <su...@acm.org> wrote:
>
> I think the order has no meaning in RDDs see this post, specially zip
> methods:
> https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
>
> _________________________________________________________________________________________________________________________
>
> Ce message et ses pieces jointes peuvent contenir des informations 
> confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu 
> ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
> electroniques etant susceptibles d'alteration,
> Orange decline toute responsabilite si ce message a ete altere, deforme ou 
> falsifie. Merci.
>
> This message and its attachments may contain confidential or privileged 
> information that may be protected by law;
> they should not be distributed, used or copied without authorisation.
> If you have received this email in error, please notify the sender and delete 
> this message and its attachments.
> As emails may be altered, Orange is not liable for messages that have been 
> modified, changed or falsified.
> Thank you.
>
>

Reply via email to