hokam chauhan wrote > So how the string value of categorical variable can be converted into > double values for forming the features vector ?
Well, the key characteristic of the variables is that their values are not ordered. So the representation you choose has to honor that. If the model is doing some arithmetic on the inputs (e.g. a logistic regression model computes a weighted sum of the inputs) or otherwise assuming an ordering of values, then the appropriate representation is the so-called "one hot" representation, in which a categorical variable of n possible values is represented as a vector of length n, in which exactly one element is 1 and the rest are 0. Depending on the models you are using, other representations might be possible. But a one-hot representation is widely applicable. > Also how the weight for individual categories can be calculated for > models. Like we have Gender as variable with categories as Male and Female > and we want to give more weight to female category, then how this can be > accomplished? Well, it probably depends on exactly what you mean by "more weight". If you mean that one category is under-represented in the available data, and you want to assume, let's say, that each datum in that category ought to count the same as two data in another category, you could just create a data set with an extra copy of those data. An equivalent method is to allow for weighting the log-likelihood or other goodness of fit function. That's more convenient and flexible (it allows for noninteger weights), but I don't remember if Spark supports that. If you mean some other kind of weighting, you'll have to explain more about what you're trying to achieve. > Also is there a way through which string values from raw text can be > converted to features vector(Apart from the HashingTF-IDF transformation) > ? I don't know any other method. Maybe someone else can suggest something. best, Robert Dodier -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-handle-categorical-variables-in-Spark-MLlib-tp25767p25803.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org