[GitHub] incubator-hivemall issue #141: [HIVEMALL-117][SPARK] Update the installation...
Github user maropu commented on the issue: https://github.com/apache/incubator-hivemall/pull/141 I'll create a new github account for this purpose and then move the repo there. So, pending until the move finished. ---
[GitHub] incubator-hivemall issue #141: [HIVEMALL-117][SPARK] Update the installation...
Github user myui commented on the issue: https://github.com/apache/incubator-hivemall/pull/141 LGTM. @maropu Could you merge this PR into master? ---
[jira] [Updated] (HIVEMALL-186) UDAF to collect Descriptive Statistics
[ https://issues.apache.org/jira/browse/HIVEMALL-186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Makoto Yui updated HIVEMALL-186: Description: UDAF to show descriptive statistics and frequency distributions by just calling a UDAF would be useful for understanding data. [http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistic] [http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.3_Frequency_distributions] was: UDAF to show descriptive statistics and frequency distributions by just calling a UDAF would be useful for understanding data.[ http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistic|http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics] [http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.3_Frequency_distributions] > UDAF to collect Descriptive Statistics > -- > > Key: HIVEMALL-186 > URL: https://issues.apache.org/jira/browse/HIVEMALL-186 > Project: Hivemall > Issue Type: Improvement >Reporter: Makoto Yui >Priority: Minor > Fix For: 0.6.0 > > > UDAF to show descriptive statistics and frequency distributions by just > calling a UDAF would be useful for understanding data. > [http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistic] > [http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.3_Frequency_distributions] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVEMALL-186) UDAF to collect Descriptive Statistics
Makoto Yui created HIVEMALL-186: --- Summary: UDAF to collect Descriptive Statistics Key: HIVEMALL-186 URL: https://issues.apache.org/jira/browse/HIVEMALL-186 Project: Hivemall Issue Type: Improvement Reporter: Makoto Yui Fix For: 0.6.0 UDAF to show descriptive statistics and frequency distributions by just calling a UDAF would be useful for understanding data.[ http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistic|http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics] [http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.3_Frequency_distributions] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424912#comment-16424912 ] Makoto Yui commented on HIVEMALL-181: - [~takuti] is working on this kind of feature selection mechanism in our company. It's named GUESS feature to select meaningful columns. It uses [Chain of Responsibility|https://en.wikipedia.org/wiki/Chain-of-responsibility_pattern] pattern for filtering rules. There are a lot of rules including heuristics to filer out ID columns from exploratory variables. Using standard deviation would be most beneficial for filtering rule. > Plan rewrting rules to filter meaningful training data before feature > selections > > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times. scikit-learn has some > APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this > selection is too time-consuming process if training data have a large number > of columns (the number could frequently go over 1,000 in business use cases). > An objective of this ticket is to add new optimizer rules in Spark to filter > meaningful training data before feature selection. As a pretty simple > example, Spark might be able to filter out columns with low variances (This > process is corresponding to `VarianceThreshold` in scikit-learn) by > implicitly adding a `Project` node in the top of an user plan. Then, the > Spark optimizer might push down this `Project` node into leaf nodes (e.g., > `LogicalRelation`) and the plan execution could be significantly faster. > Moreover, more sophisticated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functionalities) in this ticket to track them. > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe > to avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVEMALL-185) Add an optimizer rule to push down a Sample plan node into fact tables
Takeshi Yamamuro created HIVEMALL-185: - Summary: Add an optimizer rule to push down a Sample plan node into fact tables Key: HIVEMALL-185 URL: https://issues.apache.org/jira/browse/HIVEMALL-185 Project: Hivemall Issue Type: Sub-task Reporter: Takeshi Yamamuro Assignee: Takeshi Yamamuro Sampling is a common technique to extract a part of data in joined relations (fact tables and dimension tables) for training data. The optimizer in Spark cannot push down a Sample plan node into larger fact tables because this node is non-deterministic. But, by using RI constraints, we could push down this node into fact tables in some cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-184) Add an optimizer rule to filter out columns by using Mutual Information
[ https://issues.apache.org/jira/browse/HIVEMALL-184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-184: -- Labels: spark (was: ) > Add an optimizer rule to filter out columns by using Mutual Information > --- > > Key: HIVEMALL-184 > URL: https://issues.apache.org/jira/browse/HIVEMALL-184 > Project: Hivemall > Issue Type: Sub-task >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > Mutual Information (MI) is an indicator to find and quantify dependencies > between variables, so the indicator is useful to filter out columns in > feature selection. Nearest-neighbor distances are frequently used to estimate > MI [1], so we could use the distances to compute MI between columns for each > relation when running an ANALYZE command. Then, we could filter out "similar" > columns in the optimizer phase by referring a new threshold (e.g. > `spark.sql.optimizer.featureSelection.mutualInfoThreshold`). > In another story, we need to consider a light-weight way to update MI when > re-running an ANALYZE command. A recent study [2] proposed a sophisticated > technique to compute MI for dynamic data. > [1] Dafydd Evans, A computationally efficient estimator for mutual > information. > In Proceedings of the Royal Society of London A: Mathematical, Physical > and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008. > [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information > Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-184) Add an optimizer rule to filter out columns by using Mutual Information
[ https://issues.apache.org/jira/browse/HIVEMALL-184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-184: -- Description: Mutual Information (MI) is an indicator to find and quantify dependencies between variables, so the indicator is useful to filter out columns in feature selection. Nearest-neighbor distances are frequently used to estimate MI [1], so we could use the distances to compute MI between columns for each relation when running an ANALYZE command. Then, we could filter out "similar" columns in the optimizer phase by referring a new threshold (e.g. `spark.sql.optimizer.featureSelection.mutualInfoThreshold`). In another story, we need to consider a light-weight way to update MI when re-running an ANALYZE command. A recent study [2] proposed a sophisticated technique to compute MI for dynamic data. [1] Dafydd Evans, A computationally efficient estimator for mutual information. In Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008. [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018. was: Mutual Information (MI) is an indicator to find and quantify dependencies between variables, so the indicator is useful to filter out columns in feature selection. Nearest-neighbor distances are frequently used to estimate MI [1], so we could use the distances to compute MI between columns for each relation when running an ANALYZE command. Then, we could filter out "similar" columns in the optimizer phase by referring a new threshold (e.g. `spark.sql.optimizer.featureSelection.mutualInfoThreshold`). In another story, we need to consider a light-weight way to update MI when re-running an ANALYZE command. A recent study [2] proposed a sophisticated technique to compute MI for dynamic data. [1] Dafydd Evans, A computationally efficient estimator for mutual information. In Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008. [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018. > Add an optimizer rule to filter out columns by using Mutual Information > --- > > Key: HIVEMALL-184 > URL: https://issues.apache.org/jira/browse/HIVEMALL-184 > Project: Hivemall > Issue Type: Sub-task >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > Mutual Information (MI) is an indicator to find and quantify dependencies > between variables, so the indicator is useful to filter out columns in > feature selection. Nearest-neighbor distances are frequently used to estimate > MI [1], so we could use the distances to compute MI between columns for each > relation when running an ANALYZE command. Then, we could filter out "similar" > columns in the optimizer phase by referring a new threshold (e.g. > `spark.sql.optimizer.featureSelection.mutualInfoThreshold`). > In another story, we need to consider a light-weight way to update MI when > re-running an ANALYZE command. A recent study [2] proposed a sophisticated > technique to compute MI for dynamic data. > [1] Dafydd Evans, A computationally efficient estimator for mutual > information. In Proceedings of the Royal Society of London A: Mathematical, > Physical > and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008. > [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual > Information Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVEMALL-184) Add an optimizer rule to filter out columns by using Mutual Information
Takeshi Yamamuro created HIVEMALL-184: - Summary: Add an optimizer rule to filter out columns by using Mutual Information Key: HIVEMALL-184 URL: https://issues.apache.org/jira/browse/HIVEMALL-184 Project: Hivemall Issue Type: Sub-task Reporter: Takeshi Yamamuro Assignee: Takeshi Yamamuro Mutual Information (MI) is an indicator to find and quantify dependencies between variables, so the indicator is useful to filter out columns in feature selection. Nearest-neighbor distances are frequently used to estimate MI [1], so we could use the distances to compute MI between columns for each relation when running an ANALYZE command. Then, we could filter out "similar" columns in the optimizer phase by referring a new threshold (e.g. `spark.sql.optimizer.featureSelection.mutualInfoThreshold`). In another story, we need to consider a light-weight way to update MI when re-running an ANALYZE command. A recent study [2] proposed a sophisticated technique to compute MI for dynamic data. [1] Dafydd Evans, A computationally efficient estimator for mutual information. In Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008. [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] incubator-hivemall pull request #141: [HIVEMALL-117][SPARK] Update the insta...
GitHub user maropu opened a pull request: https://github.com/apache/incubator-hivemall/pull/141 [HIVEMALL-117][SPARK] Update the installation guide for Spark ## What changes were proposed in this pull request? This pr updated the installation guide for Spark. ## What type of PR is it? Documentation ## What is the Jira issue? https://issues.apache.org/jira/browse/HIVEMALL-117 ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/maropu/incubator-hivemall HIVEMALL-117 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hivemall/pull/141.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #141 commit 1c0eb11b3095f8891d95ba84a84019c2e0142d47 Author: Takeshi YamamuroDate: 2018-04-04T01:27:27Z Update the installation guide for Spark ---
[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Summary: Plan rewrting rules to filter meaningful training data before feature selections (was: Plan rewrting rules to filter meaningful training data before future selections) > Plan rewrting rules to filter meaningful training data before feature > selections > > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times. scikit-learn has some > APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this > selection is too time-consuming process if training data have a large number > of columns (the number could frequently go over 1,000 in business use cases). > An objective of this ticket is to add new optimizer rules in Spark to filter > meaningful training data before feature selection. As a pretty simple > example, Spark might be able to filter out columns with low variances (This > process is corresponding to `VarianceThreshold` in scikit-learn) by > implicitly adding a `Project` node in the top of an user plan. Then, the > Spark optimizer might push down this `Project` node into leaf nodes (e.g., > `LogicalRelation`) and the plan execution could be significantly faster. > Moreover, more sophisticated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functionalities) in this ticket to track them. > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe > to avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before future selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Description: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times. scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this selection is too time-consuming process if training data have a large number of columns (the number could frequently go over 1,000 in business use cases). An objective of this ticket is to add new optimizer rules in Spark to filter meaningful training data before feature selection. As a pretty simple example, Spark might be able to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophisticated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functionalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. was: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times. scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this selection is too time-consuming process if training data have a large number of columns (the number could frequently go over 1,000 in business use cases). An objective of this ticket is to add new optimizer rules in Spark to filter meaningful training data before feature selection. As a simple example, Spark might be able to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophicated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functinalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. > Plan rewrting rules to filter meaningful training data before future > selections > --- > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times. scikit-learn has some > APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this > selection is too time-consuming process if training data have a large number > of columns (the number could frequently go over 1,000 in business use cases). > An objective of this ticket is to add new optimizer rules in Spark to filter > meaningful training data before feature selection. As a pretty simple > example, Spark might be able to filter out columns with low variances (This > process is corresponding to `VarianceThreshold` in scikit-learn) by > implicitly adding a `Project` node in the top of an user plan. Then, the > Spark optimizer might push down this `Project` node into leaf nodes (e.g., > `LogicalRelation`) and the plan execution could be significantly faster. > Moreover, more sophisticated techniques
[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before future selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Description: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times. scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this selection is too time-consuming process if training data have a large number of columns (the number could frequently go over 1,000 in business use cases). An objective of this ticket is to add new optimizer rules in Spark to filter meaningful training data before feature selection. As a simple example, Spark might be able to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophicated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functinalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. was: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times. scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this selection is too time-consuming process if training data have a large number of columns (the number could frequently go over 1,000 in business use cases). An objective of this ticket is to add new optimizer rules in Spark to filter out meaningless columns before feature selection. As a simple example, Spark might be able to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophicated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functinalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. > Plan rewrting rules to filter meaningful training data before future > selections > --- > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times. scikit-learn has some > APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this > selection is too time-consuming process if training data have a large number > of columns (the number could frequently go over 1,000 in business use cases). > An objective of this ticket is to add new optimizer rules in Spark to filter > meaningful training data before feature selection. As a simple example, Spark > might be able to filter out columns with low variances (This process is > corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a > `Project` node in the top of an user plan. > Then, the Spark optimizer might push down this `Project` node into leaf > nodes (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophicated techniques have been proposed in
[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter out meaningful training data before future selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Description: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times. scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this selection is too time-consuming process if training data have a large number of columns (the number could frequently go over 1,000 in business use cases). An objective of this ticket is to add new optimizer rules in Spark to filter out meaningless columns before feature selection. As a simple example, Spark might be able to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophicated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functinalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. was: In machine learning and statistics, feature selection is a useful techniqe to choose a subset of relevant features in model construction for simplification of models and shorter training times. scikit-learn has some APIs for feature selection (http://scikit-learn.org/stable/modules/feature_selection.html), but this selection is too time-consuming process if training data have a large number of columns (the number could frequently go over 1,000 in bisiness use cases). An objective of this ticket is to add new optimizer rules in Spark to filter out meaningless columns before feature selection. As a simple example, Spark might be able to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophicated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functinalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. > Plan rewrting rules to filter out meaningful training data before future > selections > --- > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times. scikit-learn has some > APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this > selection is too time-consuming process if training data have a large number > of columns (the number could frequently go over 1,000 in business use cases). > An objective of this ticket is to add new optimizer rules in Spark to filter > out meaningless columns before feature selection. As a simple example, Spark > might be able to filter out columns with low variances (This process is > corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a > `Project` node in the top of an user plan. > Then, the Spark optimizer might push down this `Project` node into leaf > nodes (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophicated techniques have been proposed in
[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before future selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Summary: Plan rewrting rules to filter meaningful training data before future selections (was: Plan rewrting rules to filter out meaningful training data before future selections) > Plan rewrting rules to filter meaningful training data before future > selections > --- > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times. scikit-learn has some > APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this > selection is too time-consuming process if training data have a large number > of columns (the number could frequently go over 1,000 in business use cases). > An objective of this ticket is to add new optimizer rules in Spark to filter > out meaningless columns before feature selection. As a simple example, Spark > might be able to filter out columns with low variances (This process is > corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a > `Project` node in the top of an user plan. > Then, the Spark optimizer might push down this `Project` node into leaf > nodes (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophicated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functinalities) in this ticket to track them. > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe > to avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter out meaningful training data before future selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Summary: Plan rewrting rules to filter out meaningful training data before future selections (was: Plan rewrting rules to filter out meaningless columns before future selections) > Plan rewrting rules to filter out meaningful training data before future > selections > --- > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is a useful techniqe to > choose a subset of relevant features in model construction for simplification > of models and shorter training times. scikit-learn has some APIs for feature > selection (http://scikit-learn.org/stable/modules/feature_selection.html), > but this selection is too time-consuming process if training data have a > large number of columns (the number could frequently go over 1,000 in > bisiness use cases). > An objective of this ticket is to add new optimizer rules in Spark to filter > out meaningless columns before feature selection. As a simple example, Spark > might be able to filter out columns with low variances (This process is > corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a > `Project` node in the top of an user plan. > Then, the Spark optimizer might push down this `Project` node into leaf nodes > (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophicated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functinalities) in this ticket to track them. > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to > avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)