[ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993742#comment-13993742 ]
Erik J. Erlandson commented on SPARK-1473: ------------------------------------------ I'm fairly new to Spark, and hopefully what follows isn't old news... Feature subsetting ought (imo) to be considered as part of a larger picture that involves various ETL-like tasks such as *) data assessment -- examining data columns to assess data types (real, integer, categorical/binary), identify noise in data (empty/missing values, bad values), suggest possible quantizations *) data quantization -- mapping values into byte encodings, sparse binary, etc *) dataset transposition -- moving from sample-wise to feature-wise orientation (e.g. decision tree training can work more efficiently when data can be traversed by feature) *) feature extraction, augmentation, reduction I don't yet have a strong feel for how these tasks should best work in spark, but in my previous lives I've found they are common and closely-integrated tasks when preparing for the care and feeding of ML models. > Feature selection for high dimensional datasets > ----------------------------------------------- > > Key: SPARK-1473 > URL: https://issues.apache.org/jira/browse/SPARK-1473 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Ignacio Zendejas > Priority: Minor > Labels: features > Fix For: 1.1.0 > > > For classification tasks involving large feature spaces in the order of tens > of thousands or higher (e.g., text classification with n-grams, where n > 1), > it is often useful to rank and filter features that are irrelevant thereby > reducing the feature space by at least one or two orders of magnitude without > impacting performance on key evaluation metrics (accuracy/precision/recall). > A feature evaluation interface which is flexible needs to be designed and at > least two methods should be implemented with Information Gain being a > priority as it has been shown to be amongst the most reliable. > Special consideration should be taken in the design to account for wrapper > methods (see research papers below) which are more practical for lower > dimensional data. > Relevant research: > * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional > likelihood maximisation: a unifying framework for information theoretic > feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. > * Forman, George. "An extensive empirical study of feature selection metrics > for text classification." The Journal of machine learning research 3 (2003): > 1289-1305. -- This message was sent by Atlassian JIRA (v6.2#6252)