[GitHub] incubator-hivemall issue #141: [HIVEMALL-117][SPARK] Update the installation...

2018-04-03 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/incubator-hivemall/pull/141
  
I'll create a new github account for this purpose and then move the repo 
there.
So, pending until the move finished.


---


[GitHub] incubator-hivemall issue #141: [HIVEMALL-117][SPARK] Update the installation...

2018-04-03 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/141
  
LGTM. 

@maropu 
Could you merge this PR into master?


---


[jira] [Updated] (HIVEMALL-186) UDAF to collect Descriptive Statistics

2018-04-03 Thread Makoto Yui (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVEMALL-186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Makoto Yui updated HIVEMALL-186:

Description: 
UDAF to show descriptive statistics and frequency distributions by just calling 
a UDAF would be useful for understanding data.

[http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistic]

[http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.3_Frequency_distributions]

  was:
UDAF to show descriptive statistics and frequency distributions by just calling 
a UDAF would be useful for understanding data.[

http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistic|http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics]

[http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.3_Frequency_distributions]


> UDAF to collect Descriptive Statistics
> --
>
> Key: HIVEMALL-186
> URL: https://issues.apache.org/jira/browse/HIVEMALL-186
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Makoto Yui
>Priority: Minor
> Fix For: 0.6.0
>
>
> UDAF to show descriptive statistics and frequency distributions by just 
> calling a UDAF would be useful for understanding data.
> [http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistic]
> [http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.3_Frequency_distributions]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVEMALL-186) UDAF to collect Descriptive Statistics

2018-04-03 Thread Makoto Yui (JIRA)
Makoto Yui created HIVEMALL-186:
---

 Summary: UDAF to collect Descriptive Statistics
 Key: HIVEMALL-186
 URL: https://issues.apache.org/jira/browse/HIVEMALL-186
 Project: Hivemall
  Issue Type: Improvement
Reporter: Makoto Yui
 Fix For: 0.6.0


UDAF to show descriptive statistics and frequency distributions by just calling 
a UDAF would be useful for understanding data.[

http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistic|http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics]

[http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.3_Frequency_distributions]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before feature selections

2018-04-03 Thread Makoto Yui (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424912#comment-16424912
 ] 

Makoto Yui commented on HIVEMALL-181:
-

[~takuti] is working on this kind of feature selection mechanism in our company.

It's named GUESS feature to select meaningful columns. It uses [Chain of 
Responsibility|https://en.wikipedia.org/wiki/Chain-of-responsibility_pattern] 
pattern for filtering rules.

There are a lot of rules including heuristics to filer out ID columns from 
exploratory variables. Using standard deviation would be most beneficial for 
filtering rule.

> Plan rewrting rules to filter meaningful training data before feature 
> selections
> 
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times. scikit-learn has some 
> APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
> selection is too time-consuming process if training data have a large number 
> of columns (the number could frequently go over 1,000 in business use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter 
> meaningful training data before feature selection. As a pretty simple 
> example, Spark might be able to filter out columns with low variances (This 
> process is corresponding to `VarianceThreshold` in scikit-learn) by 
> implicitly adding a `Project` node in the top of an user plan. Then, the 
> Spark optimizer might push down this `Project` node into leaf nodes (e.g., 
> `LogicalRelation`) and the plan execution could be significantly faster. 
> Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functionalities) in this ticket to track them.
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe 
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVEMALL-185) Add an optimizer rule to push down a Sample plan node into fact tables

2018-04-03 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created HIVEMALL-185:
-

 Summary: Add an optimizer rule to push down a Sample plan node 
into fact tables
 Key: HIVEMALL-185
 URL: https://issues.apache.org/jira/browse/HIVEMALL-185
 Project: Hivemall
  Issue Type: Sub-task
Reporter: Takeshi Yamamuro
Assignee: Takeshi Yamamuro


Sampling is a common technique to extract a part of data in joined relations 
(fact tables and dimension tables) for training data. The optimizer in Spark 
cannot push down a Sample plan node into larger fact tables because this node 
is non-deterministic. But, by using RI constraints, we could push down this 
node into fact tables in some cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVEMALL-184) Add an optimizer rule to filter out columns by using Mutual Information

2018-04-03 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVEMALL-184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-184:
--
Labels: spark  (was: )

> Add an optimizer rule to filter out columns by using Mutual Information
> ---
>
> Key: HIVEMALL-184
> URL: https://issues.apache.org/jira/browse/HIVEMALL-184
> Project: Hivemall
>  Issue Type: Sub-task
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> Mutual Information (MI) is an indicator to find and quantify dependencies 
> between variables, so the indicator is useful to filter out columns in 
> feature selection. Nearest-neighbor distances are frequently used to estimate 
> MI [1], so we could use the distances to compute MI between columns for each 
> relation when running an ANALYZE command. Then, we could filter out "similar" 
> columns in the optimizer phase by referring a new threshold (e.g. 
> `spark.sql.optimizer.featureSelection.mutualInfoThreshold`).
> In another story, we need to consider a light-weight way to update MI when 
> re-running an ANALYZE command. A recent study [2] proposed a sophisticated 
> technique to compute MI for dynamic data.
> [1] Dafydd Evans, A computationally efficient estimator for mutual 
> information.
> In Proceedings of the Royal Society of London A: Mathematical, Physical
> and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008.
> [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information
> Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVEMALL-184) Add an optimizer rule to filter out columns by using Mutual Information

2018-04-03 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVEMALL-184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-184:
--
Description: 
Mutual Information (MI) is an indicator to find and quantify dependencies 
between variables, so the indicator is useful to filter out columns in feature 
selection. Nearest-neighbor distances are frequently used to estimate MI [1], 
so we could use the distances to compute MI between columns for each relation 
when running an ANALYZE command. Then, we could filter out "similar" columns in 
the optimizer phase by referring a new threshold (e.g. 
`spark.sql.optimizer.featureSelection.mutualInfoThreshold`).

In another story, we need to consider a light-weight way to update MI when 
re-running an ANALYZE command. A recent study [2] proposed a sophisticated 
technique to compute MI for dynamic data.

[1] Dafydd Evans, A computationally efficient estimator for mutual information. 
In Proceedings of the Royal Society of London A: Mathematical, Physical
 and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008.
 [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information 
Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018.

  was:
Mutual Information (MI) is an indicator to find and quantify dependencies 
between variables, so the indicator is useful to filter out columns in feature 
selection. Nearest-neighbor distances are frequently used to estimate MI [1], 
so we could use the distances to compute MI between columns for each relation 
when running an ANALYZE command. Then, we could filter out "similar" columns in 
the optimizer phase by referring a new threshold (e.g. 
`spark.sql.optimizer.featureSelection.mutualInfoThreshold`).

In another story, we need to consider a light-weight way to update MI when 
re-running an ANALYZE command. A recent study [2] proposed a sophisticated 
technique to compute MI for dynamic data.

[1] Dafydd Evans, A computationally efficient estimator for mutual information.
In Proceedings of the Royal Society of London A: Mathematical, Physical
and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008.
[2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information
Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018.



> Add an optimizer rule to filter out columns by using Mutual Information
> ---
>
> Key: HIVEMALL-184
> URL: https://issues.apache.org/jira/browse/HIVEMALL-184
> Project: Hivemall
>  Issue Type: Sub-task
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> Mutual Information (MI) is an indicator to find and quantify dependencies 
> between variables, so the indicator is useful to filter out columns in 
> feature selection. Nearest-neighbor distances are frequently used to estimate 
> MI [1], so we could use the distances to compute MI between columns for each 
> relation when running an ANALYZE command. Then, we could filter out "similar" 
> columns in the optimizer phase by referring a new threshold (e.g. 
> `spark.sql.optimizer.featureSelection.mutualInfoThreshold`).
> In another story, we need to consider a light-weight way to update MI when 
> re-running an ANALYZE command. A recent study [2] proposed a sophisticated 
> technique to compute MI for dynamic data.
> [1] Dafydd Evans, A computationally efficient estimator for mutual 
> information. In Proceedings of the Royal Society of London A: Mathematical, 
> Physical
>  and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008.
>  [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual 
> Information Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVEMALL-184) Add an optimizer rule to filter out columns by using Mutual Information

2018-04-03 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created HIVEMALL-184:
-

 Summary: Add an optimizer rule to filter out columns by using 
Mutual Information
 Key: HIVEMALL-184
 URL: https://issues.apache.org/jira/browse/HIVEMALL-184
 Project: Hivemall
  Issue Type: Sub-task
Reporter: Takeshi Yamamuro
Assignee: Takeshi Yamamuro


Mutual Information (MI) is an indicator to find and quantify dependencies 
between variables, so the indicator is useful to filter out columns in feature 
selection. Nearest-neighbor distances are frequently used to estimate MI [1], 
so we could use the distances to compute MI between columns for each relation 
when running an ANALYZE command. Then, we could filter out "similar" columns in 
the optimizer phase by referring a new threshold (e.g. 
`spark.sql.optimizer.featureSelection.mutualInfoThreshold`).

In another story, we need to consider a light-weight way to update MI when 
re-running an ANALYZE command. A recent study [2] proposed a sophisticated 
technique to compute MI for dynamic data.

[1] Dafydd Evans, A computationally efficient estimator for mutual information.
In Proceedings of the Royal Society of London A: Mathematical, Physical
and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008.
[2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information
Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] incubator-hivemall pull request #141: [HIVEMALL-117][SPARK] Update the insta...

2018-04-03 Thread maropu
GitHub user maropu opened a pull request:

https://github.com/apache/incubator-hivemall/pull/141

[HIVEMALL-117][SPARK] Update the installation guide for Spark

## What changes were proposed in this pull request?
This pr updated the installation guide for Spark.

## What type of PR is it?
Documentation

## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-117

## How was this patch tested?
N/A



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maropu/incubator-hivemall HIVEMALL-117

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-hivemall/pull/141.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #141


commit 1c0eb11b3095f8891d95ba84a84019c2e0142d47
Author: Takeshi Yamamuro 
Date:   2018-04-04T01:27:27Z

Update the installation guide for Spark




---


[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before feature selections

2018-04-03 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Summary: Plan rewrting rules to filter meaningful training data before 
feature selections  (was: Plan rewrting rules to filter meaningful training 
data before future selections)

> Plan rewrting rules to filter meaningful training data before feature 
> selections
> 
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times. scikit-learn has some 
> APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
> selection is too time-consuming process if training data have a large number 
> of columns (the number could frequently go over 1,000 in business use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter 
> meaningful training data before feature selection. As a pretty simple 
> example, Spark might be able to filter out columns with low variances (This 
> process is corresponding to `VarianceThreshold` in scikit-learn) by 
> implicitly adding a `Project` node in the top of an user plan. Then, the 
> Spark optimizer might push down this `Project` node into leaf nodes (e.g., 
> `LogicalRelation`) and the plan execution could be significantly faster. 
> Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functionalities) in this ticket to track them.
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe 
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before future selections

2018-04-03 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Description: 
In machine learning and statistics, feature selection is one of useful 
techniques to choose a subset of relevant data in model construction for 
simplification of models and shorter training times. scikit-learn has some APIs 
for feature selection 
([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
selection is too time-consuming process if training data have a large number of 
columns (the number could frequently go over 1,000 in business use cases).

An objective of this ticket is to add new optimizer rules in Spark to filter 
meaningful training data before feature selection. As a pretty simple example, 
Spark might be able to filter out columns with low variances (This process is 
corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
`Project` node in the top of an user plan. Then, the Spark optimizer might push 
down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan 
execution could be significantly faster. Moreover, more sophisticated 
techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and 
other OSS functionalities) in this ticket to track them.

References:
 [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
or Not to Join?: Thinking Twice about Joins before Feature Selection, 
Proceedings of SIGMOD, 2016.
 [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.

  was:
In machine learning and statistics, feature selection is one of useful 
techniques to choose a subset of relevant data in model construction for 
simplification of models and shorter training times. scikit-learn has some APIs 
for feature selection 
([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
selection is too time-consuming process if training data have a large number of 
columns (the number could frequently go over 1,000 in business use cases).

An objective of this ticket is to add new optimizer rules in Spark to filter 
meaningful training data before feature selection. As a simple example, Spark 
might be able to filter out columns with low variances (This process is 
corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
`Project` node in the top of an user plan.
 Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and the plan execution could be significantly faster. 
Moreover, more sophicated techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and 
other OSS functinalities) in this ticket to track them.

References:
 [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
or Not to Join?: Thinking Twice about Joins before Feature Selection, 
Proceedings of SIGMOD, 2016.
 [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.


> Plan rewrting rules to filter meaningful training data before future 
> selections
> ---
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times. scikit-learn has some 
> APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
> selection is too time-consuming process if training data have a large number 
> of columns (the number could frequently go over 1,000 in business use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter 
> meaningful training data before feature selection. As a pretty simple 
> example, Spark might be able to filter out columns with low variances (This 
> process is corresponding to `VarianceThreshold` in scikit-learn) by 
> implicitly adding a `Project` node in the top of an user plan. Then, the 
> Spark optimizer might push down this `Project` node into leaf nodes (e.g., 
> `LogicalRelation`) and the plan execution could be significantly faster. 
> Moreover, more sophisticated techniques 

[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before future selections

2018-04-03 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Description: 
In machine learning and statistics, feature selection is one of useful 
techniques to choose a subset of relevant data in model construction for 
simplification of models and shorter training times. scikit-learn has some APIs 
for feature selection 
([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
selection is too time-consuming process if training data have a large number of 
columns (the number could frequently go over 1,000 in business use cases).

An objective of this ticket is to add new optimizer rules in Spark to filter 
meaningful training data before feature selection. As a simple example, Spark 
might be able to filter out columns with low variances (This process is 
corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
`Project` node in the top of an user plan.
 Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and the plan execution could be significantly faster. 
Moreover, more sophicated techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and 
other OSS functinalities) in this ticket to track them.

References:
 [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
or Not to Join?: Thinking Twice about Joins before Feature Selection, 
Proceedings of SIGMOD, 2016.
 [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.

  was:
In machine learning and statistics, feature selection is one of useful 
techniques to choose a subset of relevant data in model construction for 
simplification of models and shorter training times. scikit-learn has some APIs 
for feature selection 
([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
selection is too time-consuming process if training data have a large number of 
columns (the number could frequently go over 1,000 in business use cases).

An objective of this ticket is to add new optimizer rules in Spark to filter 
out meaningless columns before feature selection. As a simple example, Spark 
might be able to filter out columns with low variances (This process is 
corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
`Project` node in the top of an user plan.
 Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and the plan execution could be significantly faster. 
Moreover, more sophicated techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and 
other OSS functinalities) in this ticket to track them.

References:
 [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
or Not to Join?: Thinking Twice about Joins before Feature Selection, 
Proceedings of SIGMOD, 2016.
 [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.


> Plan rewrting rules to filter meaningful training data before future 
> selections
> ---
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times. scikit-learn has some 
> APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
> selection is too time-consuming process if training data have a large number 
> of columns (the number could frequently go over 1,000 in business use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter 
> meaningful training data before feature selection. As a simple example, Spark 
> might be able to filter out columns with low variances (This process is 
> corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
> `Project` node in the top of an user plan.
>  Then, the Spark optimizer might push down this `Project` node into leaf 
> nodes (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophicated techniques have been proposed in 

[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter out meaningful training data before future selections

2018-04-03 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Description: 
In machine learning and statistics, feature selection is one of useful 
techniques to choose a subset of relevant data in model construction for 
simplification of models and shorter training times. scikit-learn has some APIs 
for feature selection 
([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
selection is too time-consuming process if training data have a large number of 
columns (the number could frequently go over 1,000 in business use cases).

An objective of this ticket is to add new optimizer rules in Spark to filter 
out meaningless columns before feature selection. As a simple example, Spark 
might be able to filter out columns with low variances (This process is 
corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
`Project` node in the top of an user plan.
 Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and the plan execution could be significantly faster. 
Moreover, more sophicated techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and 
other OSS functinalities) in this ticket to track them.

References:
 [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
or Not to Join?: Thinking Twice about Joins before Feature Selection, 
Proceedings of SIGMOD, 2016.
 [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.

  was:
In machine learning and statistics, feature selection is a useful techniqe to 
choose a subset of relevant features in model construction for simplification 
of models and shorter training times. scikit-learn has some APIs for feature 
selection (http://scikit-learn.org/stable/modules/feature_selection.html), but 
this selection is too time-consuming process if training data have a large 
number of columns (the number could frequently go over 1,000 in bisiness use 
cases).

An objective of this ticket is to add new optimizer rules in Spark to filter 
out meaningless columns before feature selection.  As a simple example, Spark 
might be able to filter out columns with low variances (This process is 
corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
`Project` node in the top of an user plan.
Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and the plan execution could be significantly faster. 
Moreover, more sophicated techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and 
other OSS functinalities) in this ticket to track them.

References:
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or 
Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings 
of SIGMOD, 2016.
[2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017. 


> Plan rewrting rules to filter out meaningful training data before future 
> selections
> ---
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times. scikit-learn has some 
> APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
> selection is too time-consuming process if training data have a large number 
> of columns (the number could frequently go over 1,000 in business use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter 
> out meaningless columns before feature selection. As a simple example, Spark 
> might be able to filter out columns with low variances (This process is 
> corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
> `Project` node in the top of an user plan.
>  Then, the Spark optimizer might push down this `Project` node into leaf 
> nodes (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophicated techniques have been proposed in 

[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before future selections

2018-04-03 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Summary: Plan rewrting rules to filter meaningful training data before 
future selections  (was: Plan rewrting rules to filter out meaningful training 
data before future selections)

> Plan rewrting rules to filter meaningful training data before future 
> selections
> ---
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times. scikit-learn has some 
> APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
> selection is too time-consuming process if training data have a large number 
> of columns (the number could frequently go over 1,000 in business use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter 
> out meaningless columns before feature selection. As a simple example, Spark 
> might be able to filter out columns with low variances (This process is 
> corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
> `Project` node in the top of an user plan.
>  Then, the Spark optimizer might push down this `Project` node into leaf 
> nodes (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophicated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functinalities) in this ticket to track them.
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe 
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter out meaningful training data before future selections

2018-04-03 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Summary: Plan rewrting rules to filter out meaningful training data before 
future selections  (was: Plan rewrting rules to filter out meaningless columns 
before future selections)

> Plan rewrting rules to filter out meaningful training data before future 
> selections
> ---
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> In machine learning and statistics, feature selection is a useful techniqe to 
> choose a subset of relevant features in model construction for simplification 
> of models and shorter training times. scikit-learn has some APIs for feature 
> selection (http://scikit-learn.org/stable/modules/feature_selection.html), 
> but this selection is too time-consuming process if training data have a 
> large number of columns (the number could frequently go over 1,000 in 
> bisiness use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter 
> out meaningless columns before feature selection.  As a simple example, Spark 
> might be able to filter out columns with low variances (This process is 
> corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
> `Project` node in the top of an user plan.
> Then, the Spark optimizer might push down this `Project` node into leaf nodes 
> (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophicated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functinalities) in this ticket to track them.
> References:
> [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
> [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
> avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)