Pravesh, Correct, the logistic regression engine is set up to perform classification tasks that take feature vectors (arrays of real-valued numbers) that are given a class label, and learning a linear combination of those features that divide the classes. As the above commenters have mentioned, there's lots of different ways to turn string data into feature vectors.
For instance, if you're classifying documents between, say, spam or valid email, you may want to start with a bag-of-words model (http://en.wikipedia.org/wiki/Bag-of-words_model ) or the rescaled variant TF-IDF ( http://en.wikipedia.org/wiki/Tf%E2%80%93idf ). You'd turn a single document into a single, high-dimensional, sparse vector whose element j encodes the number of appearance term j. Maybe you want to try the experiment by featurizing on bigrams, trigrams, etc... Or if you're just trying to tell "english language tweets" from "non-english language tweets", in which case the bag of words might be overkill: you could instead try featurizing on just the counts of each pair of consecutive characters. E.g., the first element counts "aa" appearances, then the second "ab"...., then "zy" then "zz". Those will be smaller feature vectors, capturing less information, but it's probably sufficient for the simpler task, and you'll be able to fit the model with less data than trying to fit a whole-word-based model. Different applications are going to need more or less context from your strings -- whole words? n-grams? just characters? treat them as ENUMs as in the days of week example? -- so it might not make sense for Spark to come with "a direct way" to turn a string attribute into a vector for use in logistic regression. You'll have to settle on the featurization approach that's right for your domain before you try training the logistic regression classifier on your labelled feature vectors. Best, -Brian -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-String-Dataset-for-Logistic-Regression-tp5523p5882.html Sent from the Apache Spark User List mailing list archive at Nabble.com.