Take a look at the examples here:
https://spark.apache.org/docs/latest/ml-guide.html

Mohammed

From: rishikesh thakur [mailto:rishikeshtha...@hotmail.com]
Sent: Saturday, July 4, 2015 10:49 PM
To: ayan guha; Michal Čizmazia
Cc: user
Subject: RE: Feature Generation On Spark

I have one document per file and each file is to be converted to a feature 
vector. Pretty much like standard feature construction for document 
classification.

Thanks
Rishi
________________________________
Date: Sun, 5 Jul 2015 01:44:04 +1000
Subject: Re: Feature Generation On Spark
From: guha.a...@gmail.com<mailto:guha.a...@gmail.com>
To: mici...@gmail.com<mailto:mici...@gmail.com>
CC: rishikeshtha...@hotmail.com<mailto:rishikeshtha...@hotmail.com>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Do you have one document per file or multiple document in the file?
On 4 Jul 2015 23:38, "Michal Čizmazia" 
<mici...@gmail.com<mailto:mici...@gmail.com>> wrote:
Spark Context has a method wholeTextFiles. Is that what you need?

On 4 July 2015 at 07:04, rishikesh 
<rishikeshtha...@hotmail.com<mailto:rishikeshtha...@hotmail.com>> wrote:
> Hi
>
> I am new to Spark and am working on document classification. Before model
> fitting I need to do feature generation. Each document is to be converted to
> a feature vector. However I am not sure how to do that. While testing
> locally I have a static list of tokens and when I parse a file I do a lookup
> and increment counters.
>
> In the case of Spark I can create an RDD which loads all the documents
> however I am not sure if one files goes to one executor or multiple. If the
> file is split then the feature vectors needs to be merged. But I am not able
> to figure out how to do that.
>
> Thanks
> Rishi
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Feature-Generation-On-Spark-tp23617.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: 
> user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: 
> user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>

Reply via email to