I have a dataset that's relatively big, but easily fits in memory. I want to generate many different features for this dataset and then run L1 regularized Logistic Regression on the feature enhanced dataset.
The combined features will easily exhaust memory. I was hoping there was a way that I could generate the features on the fly for stochastic gradient descent. That is, every time the SGD routine samples from the original dataset it will calculate the new features and use those as the input. With Spark ML it seems like you can do transformations and add those to your pipeline, which would work if it all fit into memory fairly easily. But, is it possible to do something like I'm proposing ? A sort of lazy evaluation within the current library? Or do I need to somehow change GradientDescent.scala myself for this to work? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-ML-MLib-newbie-question-tp25129.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org