from:"Takeshi Yamamuro \(JIRA\)"

[jira] [Commented] (HIVEMALL-242) Drop support for Spark 2.1

2019-02-21 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVEMALL-242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773985#comment-16773985
 ] 

Takeshi Yamamuro commented on HIVEMALL-242:
---

Yea, I think so, too. I'll drop v2.1 and support v2.4.

> Drop support for Spark 2.1
> --
>
> Key: HIVEMALL-242
> URL: https://issues.apache.org/jira/browse/HIVEMALL-242
> Project: Hivemall
>  Issue Type: Task
>Affects Versions: 0.5.2
>Reporter: Makoto Yui
>Assignee: Takeshi Yamamuro
>Priority: Minor
>  Labels: spark
> Fix For: 0.6.0
>
>
> We can drop Spark 2.1 support in Hivemall. Spark 2.1 requires Java7. Spark 
> 2.2 or later requires Java8 or later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-225) Upgrade spark from v2.3.0 to v2.3.2

2018-11-14 Thread Takeshi Yamamuro (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVEMALL-225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-225:
--
Labels: spark  (was: )

> Upgrade spark from v2.3.0 to v2.3.2
> ---
>
> Key: HIVEMALL-225
> URL: https://issues.apache.org/jira/browse/HIVEMALL-225
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Priority: Trivial
>  Labels: spark
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-224) Support brickhouse functions for hivemall-spark

2018-11-14 Thread Takeshi Yamamuro (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVEMALL-224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-224:
--
Labels: spark  (was: )

> Support brickhouse functions for hivemall-spark
> ---
>
> Key: HIVEMALL-224
> URL: https://issues.apache.org/jira/browse/HIVEMALL-224
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> Add these functions in HIvemallOps; 
> https://github.com/apache/incubator-hivemall/commit/1e1b77ea4724c48f56dd1f3aa15027506558dee1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVEMALL-225) Upgrade spark from v2.3.0 to v2.3.2

2018-11-14 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-225:
-

 Summary: Upgrade spark from v2.3.0 to v2.3.2
 Key: HIVEMALL-225
 URL: https://issues.apache.org/jira/browse/HIVEMALL-225
 Project: Hivemall
  Issue Type: Improvement
Reporter: Takeshi Yamamuro






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVEMALL-224) Support brickhouse functions for hivemall-spark

2018-11-14 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-224:
-

 Summary: Support brickhouse functions for hivemall-spark
 Key: HIVEMALL-224
 URL: https://issues.apache.org/jira/browse/HIVEMALL-224
 Project: Hivemall
  Issue Type: Improvement
Reporter: Takeshi Yamamuro


Add these functions in HIvemallOps; 
https://github.com/apache/incubator-hivemall/commit/1e1b77ea4724c48f56dd1f3aa15027506558dee1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

2018-04-26 Thread Takeshi Yamamuro (JIRA)

[
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Takeshi Yamamuro updated HIVEMALL-181:
--
Description:
In machine learning and statistics, feature selection is one of useful
techniques to choose a subset of relevant data in model construction for
simplification of models and shorter training times, e.g., scikit-learn has
some APIs for feature selection
([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this
selection is too time-consuming process if training data have a large number of
columns and rows (For example, the number of columns could frequently go over
1,000 in real business use cases).

An objective of this ticket is to implement plan rewriting rules in Spark
Catalyst to filter meaningful training data before feature selection. We assume
a workflow below from data extraction to model training;

!fig1.png!

In the example workflow above, one prepares raw training data, R(v1, v2, v3,
v4) in the figure, by joining and projecting input data (R1, R2, and R3) in
various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset
(the red box) of the raw data, sampling and feature selection apply to them. In
real business use cases, it sometimes happens that raw training data have many
meaningless columns because of historical reasons (e.g., redundant schema
designs). So, if we could filter out these meaningless data in the phase of
data extraction, we should efficiently process the data extraction itself and
following feature selection. In the example above, we actually need not join
the relation R3 because all the columns in the relation are filtered out in
feature selection. Also, the join processing should be faster if we could
sample data directly in the input data (R1 and R2). This optimized workflow is
as following;

!fig2.png!

This optimization might be achived by rewriting a plan tree for data extraction
as following;

!fig3.png!

Since Spark already has a pluggable optimizer interface
(extendedOperatorOptimizationRules) and a framework to collect data statistics
for input data in data sources, the major tasks of this ticket are to add plan
rewriting rules to filter meaningful training data before feature selections.

As a pretty simple task, Spark might have a rule to filter out columns with low
variances (This process is corresponding to `VarianceThreshold` in
scikit-learn) by implicitly adding a `Project` node in the top of an user plan.
Then, the Spark optimizer might push down this `Project` node into leaf nodes
(e.g., `LogicalRelation`) and the plan execution could be significantly faster.
Moreover, more sophisticated techniques have been proposed in [1, 2, 3].

I will make pull requests as sub-tasks and put relevant activities (reseaches
and other OSS functionalities) in this ticket to track them.

*References:*
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join
or Not to Join?: Thinking Twice about Joins before Feature Selection,
Proceedings of SIGMOD, 2016.
[2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to
avoid when learning high-capacity classifiers?, Proceedings of the VLDB
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.
[3] Z. Zhao, R. Christensen, F. Li, X. Hu, K. Yi, Random Sampling over Joins
Revisited, Proceedings of SIGMOD, 2018.

was:
In machine learning and statistics, feature selection is one of useful
techniques to choose a subset of relevant data in model construction for
simplification of models and shorter training times, e.g., scikit-learn has
some APIs for feature selection
([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this
selection is too time-consuming process if training data have a large number of
columns and rows (For example, the number of columns could frequently go over
1,000 in real business use cases).

!fig1.png!

[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

2018-04-26 Thread Takeshi Yamamuro (JIRA)

[
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

!fig1.png!

!fig2.png!

This optimization might be achived by rewriting a plan tree for data extraction
as following;

!fig3.png!

I will make pull requests as sub-tasks and put relevant activities (reseaches
and other OSS functionalities) in this ticket to track them.

References:
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or
Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings
of SIGMOD, 2016.
[2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to
avoid when learning high-capacity classifiers?, Proceedings of the VLDB
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.
[3] Z. Zhao, R. Christensen, F. Li, X. Hu, K. Yi, Random Sampling over Joins
Revisited, Proceedings of SIGMOD, 2018.

!fig1.png!

[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

2018-04-26 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Description: 
In machine learning and statistics, feature selection is one of useful 
techniques to choose a subset of relevant data in model construction for 
simplification of models and shorter training times, e.g., scikit-learn has 
some APIs for feature selection 
([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
selection is too time-consuming process if training data have a large number of 
columns and rows (For example, the number of columns could frequently go over 
1,000 in real business use cases).

An objective of this ticket is to implement plan rewriting rules in Spark 
Catalyst to filter meaningful training data before feature selection. We assume 
a workflow below from data extraction to model training;

!fig1.png!

In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
(the red box) of the raw data, sampling and feature selection apply to them. In 
real business use cases, it sometimes happens that raw training data have many 
meaningless columns because of historical reasons (e.g., redundant schema 
designs). So, if we could filter out these meaningless data in the phase of 
data extraction, we should efficiently process the data extraction itself and 
following feature selection. In the example above, we actually need not join 
the relation R3 because all the columns in the relation are filtered out in 
feature selection. Also, the join processing should be faster if we could 
sample data directly in the input data (R1 and R2). This optimized workflow is 
as following;

!fig2.png!

This optimization might be achived by rewriting a plan tree for data extraction 
as following;

!fig3.png!

Since Spark already has a pluggable optimizer interface 
(extendedOperatorOptimizationRules) and a framework to collect data statistics 
for input data in data sources, the major tasks of this ticket are to add plan 
rewriting rules to filter meaningful training data before feature selections.

As a pretty simple task, Spark might have a rule to filter out columns with low 
variances (This process is corresponding to `VarianceThreshold` in 
scikit-learn) by implicitly adding a `Project` node in the top of an user plan. 
Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and the plan execution could be significantly faster. 
Moreover, more sophisticated techniques have been proposed in [1, 2, 3].

I will make pull requests as sub-tasks and put relevant activities (reseaches 
and other OSS functionalities) in this ticket to track them.

 

References:
 [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
or Not to Join?: Thinking Twice about Joins before Feature Selection, 
Proceedings of SIGMOD, 2016.
 [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.

  was:
In machine learning and statistics, feature selection is one of useful 
techniques to choose a subset of relevant data in model construction for 
simplification of models and shorter training times, e.g., scikit-learn has 
some APIs for feature selection 
([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
selection is too time-consuming process if training data have a large number of 
columns and rows (For example, the number of columns could frequently go over 
1,000 in real business use cases).

An objective of this ticket is to implement plan rewriting rules in Spark 
Catalyst to filter meaningful training data before feature selection. We assume 
a workflow below from data extraction to model training;

!fig1.png!

In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
(the red box) of the raw data, sampling and feature selection apply to them. In 
real business use cases, it sometimes happens that raw training data have many 
meaningless columns because of historical reasons (e.g., redundant schema 
designs). So, if we could filter out these meaningless data in the phase of 
data extraction, we should efficiently process the data extraction itself and 
following feature selection. In the example above, we actually need not join 
the relation R3 because all the columns in the relation are filtered out in 
feature selection. Also, the join processing should be faster if we could 
sample data directly in

[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

2018-04-26 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Description: 
In machine learning and statistics, feature selection is one of useful 
techniques to choose a subset of relevant data in model construction for 
simplification of models and shorter training times, e.g., scikit-learn has 
some APIs for feature selection 
([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
selection is too time-consuming process if training data have a large number of 
columns and rows (For example, the number of columns could frequently go over 
1,000 in real business use cases).

An objective of this ticket is to implement plan rewriting rules in Spark 
Catalyst to filter meaningful training data before feature selection. We assume 
a workflow below from data extraction to model training;

!fig1.png!

In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
(the red box) of the raw data, sampling and feature selection apply to them. In 
real business use cases, it sometimes happens that raw training data have many 
meaningless columns because of historical reasons (e.g., redundant schema 
designs). So, if we could filter out these meaningless data in the phase of 
data extraction, we should efficiently process the data extraction itself and 
following feature selection. In the example above, we actually need not join 
the relation R3 because all the columns in the relation are filtered out in 
feature selection. Also, the join processing should be faster if we could 
sample data directly in the input data (R1 and R2). This optimized workflow is 
as following;

!fig2.png!

This optimization might be achived by rewriting a plan tree for data extraction 
as following;

!fig3.png!

Since Spark already has a pluggable optimizer interface 
(extendedOperatorOptimizationRules) and a framework to collect data statistics 
for input data in data sources, the major tasks of this ticket are to add plan 
rewriting rules to filter meaningful training data before feature selections.

As a pretty simple task, Spark might have a rule to filter out columns with low 
variances (This process is corresponding to `VarianceThreshold` in 
scikit-learn) by implicitly adding a `Project` node in the top of an user plan. 
Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and the plan execution could be significantly faster. 
Moreover, more sophisticated techniques have been proposed in [1, 2, 3].

I will make pull requests as sub-tasks and put relevant activities (papers and 
other OSS functionalities) in this ticket to track them.

 

References:
 [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
or Not to Join?: Thinking Twice about Joins before Feature Selection, 
Proceedings of SIGMOD, 2016.
 [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.

  was:
In machine learning and statistics, feature selection is one of useful 
techniques to choose a subset of relevant data in model construction for 
simplification of models and shorter training times, e.g., scikit-learn has 
some APIs for feature selection 
([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
selection is too time-consuming process if training data have a large number of 
columns and rows (For example, the number of columns could frequently go over 
1,000 in real business use cases).

An objective of this ticket is to implement plan rewriting rules in Spark 
Catalyst to filter meaningful training data before feature selection. We assume 
a workflow below from data extraction to model training;

!fig1.png!

In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
(the red box) of the raw data, sampling and feature selection apply to them. In 
real business use cases, it sometimes happens that raw training data have many 
meaningless columns because of historical reasons (e.g., redundant schema 
designs). So, if we could filter out these meaningless data in the phase of 
data extraction, we should efficiently process the data extraction itself and 
following feature selection. In the example above, we actually need not join 
the relation R3 because all the columns in the relation are filtered out in 
feature selection. Also, the join processing should be faster if we could 
sample data directly in the

[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

2018-04-26 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Attachment: fig3.png
fig2.png

> Plan rewriting rules to filter meaningful training data before feature 
> selections
> -
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
> Attachments: fig1.png, fig2.png, fig3.png
>
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times, e.g., scikit-learn has 
> some APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
> selection is too time-consuming process if training data have a large number 
> of columns and rows (For example, the number of columns could frequently go 
> over 1,000 in real business use cases).
> An objective of this ticket is to implement plan rewriting rules in Spark 
> Catalyst to filter meaningful training data before feature selection. We 
> assume a workflow below from data extraction to model training;
> !fig1.png!
> In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
> v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
> various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
> (the red box) of the raw data, sampling and feature selection apply to them. 
> In real business use cases, it sometimes happens that raw training data have 
> many meaningless columns because of historical reasons (e.g., redundant 
> schema designs). So, if we could filter out these meaningless data in the 
> phase of data extraction, we should efficiently process the data extraction 
> itself and following feature selection. In the example above, we actually 
> need not join the relation R3 because all the columns in the relation are 
> filtered out in feature selection. Also, the join processing should be faster 
> if we could sample data directly in the input data (R1 and R2). This 
> optimized workflow is as following;
> !fig2.png!
> This optimization might be achived by rewriting a plan tree for data 
> extraction as following;
> !fig3.png!
> Since Spark already has a pluggable optimizer interface 
> (extendedOperatorOptimizationRules) and a framework to collect data 
> statistics for input data in data sources, the major tasks of this ticket are 
> to add plan rewriting rules to filter meaningful training data before feature 
> selections.
> As a pretty simple task, Spark might have a rule to filter out columns with 
> low variances (This process is corresponding to `VarianceThreshold` in 
> scikit-learn) by implicitly adding a `Project` node in the top of an user 
> plan. Then, the Spark optimizer might push down this `Project` node into leaf 
> nodes (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functionalities) in this ticket to track them.
>  
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe 
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

2018-04-26 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Attachment: (was: fig2.png)

> Plan rewriting rules to filter meaningful training data before feature 
> selections
> -
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
> Attachments: fig1.png, fig2.png, fig3.png
>
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times, e.g., scikit-learn has 
> some APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
> selection is too time-consuming process if training data have a large number 
> of columns and rows (For example, the number of columns could frequently go 
> over 1,000 in real business use cases).
> An objective of this ticket is to implement plan rewriting rules in Spark 
> Catalyst to filter meaningful training data before feature selection. We 
> assume a workflow below from data extraction to model training;
> !fig1.png!
> In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
> v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
> various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
> (the red box) of the raw data, sampling and feature selection apply to them. 
> In real business use cases, it sometimes happens that raw training data have 
> many meaningless columns because of historical reasons (e.g., redundant 
> schema designs). So, if we could filter out these meaningless data in the 
> phase of data extraction, we should efficiently process the data extraction 
> itself and following feature selection. In the example above, we actually 
> need not join the relation R3 because all the columns in the relation are 
> filtered out in feature selection. Also, the join processing should be faster 
> if we could sample data directly in the input data (R1 and R2). This 
> optimized workflow is as following;
> !fig2.png!
> This optimization might be achived by rewriting a plan tree for data 
> extraction as following;
> !fig3.png!
> Since Spark already has a pluggable optimizer interface 
> (extendedOperatorOptimizationRules) and a framework to collect data 
> statistics for input data in data sources, the major tasks of this ticket are 
> to add plan rewriting rules to filter meaningful training data before feature 
> selections.
> As a pretty simple task, Spark might have a rule to filter out columns with 
> low variances (This process is corresponding to `VarianceThreshold` in 
> scikit-learn) by implicitly adding a `Project` node in the top of an user 
> plan. Then, the Spark optimizer might push down this `Project` node into leaf 
> nodes (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functionalities) in this ticket to track them.
>  
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe 
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

2018-04-26 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Attachment: fig2.png
fig3.png

> Plan rewriting rules to filter meaningful training data before feature 
> selections
> -
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
> Attachments: fig1.png, fig2.png, fig3.png
>
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times, e.g., scikit-learn has 
> some APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
> selection is too time-consuming process if training data have a large number 
> of columns and rows (For example, the number of columns could frequently go 
> over 1,000 in real business use cases).
> An objective of this ticket is to implement plan rewriting rules in Spark 
> Catalyst to filter meaningful training data before feature selection. We 
> assume a workflow below from data extraction to model training;
> !fig1.png!
> In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
> v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
> various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
> (the red box) of the raw data, sampling and feature selection apply to them. 
> In real business use cases, it sometimes happens that raw training data have 
> many meaningless columns because of historical reasons (e.g., redundant 
> schema designs). So, if we could filter out these meaningless data in the 
> phase of data extraction, we should efficiently process the data extraction 
> itself and following feature selection. In the example above, we actually 
> need not join the relation R3 because all the columns in the relation are 
> filtered out in feature selection. Also, the join processing should be faster 
> if we could sample data directly in the input data (R1 and R2). This 
> optimized workflow is as following;
> !fig2.png!
> This optimization might be achived by rewriting a plan tree for data 
> extraction as following;
> !fig3.png!
> Since Spark already has a pluggable optimizer interface 
> (extendedOperatorOptimizationRules) and a framework to collect data 
> statistics for input data in data sources, the major tasks of this ticket are 
> to add plan rewriting rules to filter meaningful training data before feature 
> selections.
> As a pretty simple task, Spark might have a rule to filter out columns with 
> low variances (This process is corresponding to `VarianceThreshold` in 
> scikit-learn) by implicitly adding a `Project` node in the top of an user 
> plan. Then, the Spark optimizer might push down this `Project` node into leaf 
> nodes (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functionalities) in this ticket to track them.
>  
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe 
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

2018-04-26 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Attachment: (was: fig3.png)

> Plan rewriting rules to filter meaningful training data before feature 
> selections
> -
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
> Attachments: fig1.png
>
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times, e.g., scikit-learn has 
> some APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
> selection is too time-consuming process if training data have a large number 
> of columns and rows (For example, the number of columns could frequently go 
> over 1,000 in real business use cases).
> An objective of this ticket is to implement plan rewriting rules in Spark 
> Catalyst to filter meaningful training data before feature selection. We 
> assume a workflow below from data extraction to model training;
> !fig1.png!
> In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
> v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
> various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
> (the red box) of the raw data, sampling and feature selection apply to them. 
> In real business use cases, it sometimes happens that raw training data have 
> many meaningless columns because of historical reasons (e.g., redundant 
> schema designs). So, if we could filter out these meaningless data in the 
> phase of data extraction, we should efficiently process the data extraction 
> itself and following feature selection. In the example above, we actually 
> need not join the relation R3 because all the columns in the relation are 
> filtered out in feature selection. Also, the join processing should be faster 
> if we could sample data directly in the input data (R1 and R2). This 
> optimized workflow is as following;
> !fig2.png!
> This optimization might be achived by rewriting a plan tree for data 
> extraction as following;
> !fig3.png!
> Since Spark already has a pluggable optimizer interface 
> (extendedOperatorOptimizationRules) and a framework to collect data 
> statistics for input data in data sources, the major tasks of this ticket are 
> to add plan rewriting rules to filter meaningful training data before feature 
> selections.
> As a pretty simple task, Spark might have a rule to filter out columns with 
> low variances (This process is corresponding to `VarianceThreshold` in 
> scikit-learn) by implicitly adding a `Project` node in the top of an user 
> plan. Then, the Spark optimizer might push down this `Project` node into leaf 
> nodes (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functionalities) in this ticket to track them.
>  
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe 
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

2018-04-26 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Attachment: (was: fig2.png)

> Plan rewriting rules to filter meaningful training data before feature 
> selections
> -
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
> Attachments: fig1.png
>
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times, e.g., scikit-learn has 
> some APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
> selection is too time-consuming process if training data have a large number 
> of columns and rows (For example, the number of columns could frequently go 
> over 1,000 in real business use cases).
> An objective of this ticket is to implement plan rewriting rules in Spark 
> Catalyst to filter meaningful training data before feature selection. We 
> assume a workflow below from data extraction to model training;
> !fig1.png!
> In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
> v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
> various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
> (the red box) of the raw data, sampling and feature selection apply to them. 
> In real business use cases, it sometimes happens that raw training data have 
> many meaningless columns because of historical reasons (e.g., redundant 
> schema designs). So, if we could filter out these meaningless data in the 
> phase of data extraction, we should efficiently process the data extraction 
> itself and following feature selection. In the example above, we actually 
> need not join the relation R3 because all the columns in the relation are 
> filtered out in feature selection. Also, the join processing should be faster 
> if we could sample data directly in the input data (R1 and R2). This 
> optimized workflow is as following;
> !fig2.png!
> This optimization might be achived by rewriting a plan tree for data 
> extraction as following;
> !fig3.png!
> Since Spark already has a pluggable optimizer interface 
> (extendedOperatorOptimizationRules) and a framework to collect data 
> statistics for input data in data sources, the major tasks of this ticket are 
> to add plan rewriting rules to filter meaningful training data before feature 
> selections.
> As a pretty simple task, Spark might have a rule to filter out columns with 
> low variances (This process is corresponding to `VarianceThreshold` in 
> scikit-learn) by implicitly adding a `Project` node in the top of an user 
> plan. Then, the Spark optimizer might push down this `Project` node into leaf 
> nodes (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functionalities) in this ticket to track them.
>  
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe 
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

2018-04-26 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Attachment: fig1.png

> Plan rewriting rules to filter meaningful training data before feature 
> selections
> -
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
> Attachments: fig1.png, fig2.png, fig3.png
>
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times, e.g., scikit-learn has 
> some APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
> selection is too time-consuming process if training data have a large number 
> of columns and rows (For example, the number of columns could frequently go 
> over 1,000 in real business use cases).
> An objective of this ticket is to implement plan rewriting rules in Spark 
> Catalyst to filter meaningful training data before feature selection. We 
> assume a workflow below from data extraction to model training;
> !fig1.png!
> In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
> v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
> various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
> (the red box) of the raw data, sampling and feature selection apply to them. 
> In real business use cases, it sometimes happens that raw training data have 
> many meaningless columns because of historical reasons (e.g., redundant 
> schema designs). So, if we could filter out these meaningless data in the 
> phase of data extraction, we should efficiently process the data extraction 
> itself and following feature selection. In the example above, we actually 
> need not join the relation R3 because all the columns in the relation are 
> filtered out in feature selection. Also, the join processing should be faster 
> if we could sample data directly in the input data (R1 and R2). This 
> optimized workflow is as following;
> !fig2.png!
> This optimization might be achived by rewriting a plan tree for data 
> extraction as following;
> !fig3.png!
> Since Spark already has a pluggable optimizer interface 
> (extendedOperatorOptimizationRules) and a framework to collect data 
> statistics for input data in data sources, the major tasks of this ticket are 
> to add plan rewriting rules to filter meaningful training data before feature 
> selections.
> As a pretty simple task, Spark might have a rule to filter out columns with 
> low variances (This process is corresponding to `VarianceThreshold` in 
> scikit-learn) by implicitly adding a `Project` node in the top of an user 
> plan. Then, the Spark optimizer might push down this `Project` node into leaf 
> nodes (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functionalities) in this ticket to track them.
>  
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe 
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

2018-04-26 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Attachment: (was: fig1.png)

> Plan rewriting rules to filter meaningful training data before feature 
> selections
> -
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
> Attachments: fig1.png, fig2.png, fig3.png
>
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times, e.g., scikit-learn has 
> some APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
> selection is too time-consuming process if training data have a large number 
> of columns and rows (For example, the number of columns could frequently go 
> over 1,000 in real business use cases).
> An objective of this ticket is to implement plan rewriting rules in Spark 
> Catalyst to filter meaningful training data before feature selection. We 
> assume a workflow below from data extraction to model training;
> !fig1.png!
> In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
> v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
> various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
> (the red box) of the raw data, sampling and feature selection apply to them. 
> In real business use cases, it sometimes happens that raw training data have 
> many meaningless columns because of historical reasons (e.g., redundant 
> schema designs). So, if we could filter out these meaningless data in the 
> phase of data extraction, we should efficiently process the data extraction 
> itself and following feature selection. In the example above, we actually 
> need not join the relation R3 because all the columns in the relation are 
> filtered out in feature selection. Also, the join processing should be faster 
> if we could sample data directly in the input data (R1 and R2). This 
> optimized workflow is as following;
> !fig2.png!
> This optimization might be achived by rewriting a plan tree for data 
> extraction as following;
> !fig3.png!
> Since Spark already has a pluggable optimizer interface 
> (extendedOperatorOptimizationRules) and a framework to collect data 
> statistics for input data in data sources, the major tasks of this ticket are 
> to add plan rewriting rules to filter meaningful training data before feature 
> selections.
> As a pretty simple task, Spark might have a rule to filter out columns with 
> low variances (This process is corresponding to `VarianceThreshold` in 
> scikit-learn) by implicitly adding a `Project` node in the top of an user 
> plan. Then, the Spark optimizer might push down this `Project` node into leaf 
> nodes (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functionalities) in this ticket to track them.
>  
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe 
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

2018-04-26 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Description: 
In machine learning and statistics, feature selection is one of useful 
techniques to choose a subset of relevant data in model construction for 
simplification of models and shorter training times, e.g., scikit-learn has 
some APIs for feature selection 
([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
selection is too time-consuming process if training data have a large number of 
columns and rows (For example, the number of columns could frequently go over 
1,000 in real business use cases).

An objective of this ticket is to implement plan rewriting rules in Spark 
Catalyst to filter meaningful training data before feature selection. We assume 
a workflow below from data extraction to model training;

!fig1.png!

In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
(the red box) of the raw data, sampling and feature selection apply to them. In 
real business use cases, it sometimes happens that raw training data have many 
meaningless columns because of historical reasons (e.g., redundant schema 
designs). So, if we could filter out these meaningless data in the phase of 
data extraction, we should efficiently process the data extraction itself and 
following feature selection. In the example above, we actually need not join 
the relation R3 because all the columns in the relation are filtered out in 
feature selection. Also, the join processing should be faster if we could 
sample data directly in the input data (R1 and R2). This optimized workflow is 
as following;

!fig2.png!

This optimization might be achived by rewriting a plan tree for data extraction 
as following;

!fig3.png!

Since Spark already has a pluggable optimizer interface 
(extendedOperatorOptimizationRules) and a framework to collect data statistics 
for input data in data sources, the major tasks of this ticket are to add plan 
rewriting rules to filter meaningful training data before feature selections.

As a pretty simple task, Spark might have a rule to filter out columns with low 
variances (This process is corresponding to `VarianceThreshold` in 
scikit-learn) by implicitly adding a `Project` node in the top of an user plan. 
Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and the plan execution could be significantly faster. 
Moreover, more sophisticated techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and 
other OSS functionalities) in this ticket to track them.

 

References:
 [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
or Not to Join?: Thinking Twice about Joins before Feature Selection, 
Proceedings of SIGMOD, 2016.
 [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.

  was:
In machine learning and statistics, feature selection is one of useful 
techniques to choose a subset of relevant data in model construction for 
simplification of models and shorter training times, e.g., scikit-learn has 
some APIs for feature selection 
([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
selection is too time-consuming process if training data have a large number of 
columns and rows (For example, the number of columns could frequently go over 
1,000 in real business use cases).

An objective of this ticket is to implement plan rewriting rules in Spark 
Catalyst to filter meaningful training data before feature selection. We assume 
a workflow below from data extraction to model training;

!fig1.png!

In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
(the red box) of the raw data, sampling and feature selection apply to them. In 
real business use cases, it sometimes happens that raw training data have many 
meaningless columns because of historical reasons (e.g., redundant schema 
designs). So, if we could filter out these meaningless data in the phase of 
data extraction, we should efficiently process the data extraction itself and 
following feature selection. In the example above, we actually need not join 
the relation R3 because all the columns in the relation are filtered out in 
feature selection. Also, the join processing should be faster if we could 
sample data directly in the

[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

2018-04-26 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Attachment: fig3.png

> Plan rewriting rules to filter meaningful training data before feature 
> selections
> -
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
> Attachments: fig1.png, fig2.png, fig3.png
>
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times, e.g., scikit-learn has 
> some APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
> selection is too time-consuming process if training data have a large number 
> of columns and rows (For example, the number of columns could frequently go 
> over 1,000 in real business use cases).
> An objective of this ticket is to implement plan rewriting rules in Spark 
> Catalyst to filter meaningful training data before feature selection. We 
> assume a workflow below from data extraction to model training;
> !fig1.png!
> In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
> v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
> various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
> (the red box) of the raw data, sampling and feature selection apply to them. 
> In real business use cases, it sometimes happens that raw training data have 
> many meaningless columns because of historical reasons (e.g., redundant 
> schema designs). So, if we could filter out these meaningless data in the 
> phase of data extraction, we should efficiently process the data extraction 
> itself and following feature selection. In the example above, we actually 
> need not join the relation R3 because all the columns in the relation are 
> filtered out in feature selection. Also, the join processing should be faster 
> if we could sample data directly in the input data (R1 and R2). This 
> optimized workflow is as following;
> 
> This optimization might be achived by rewriting a plan tree for data 
> extraction as following;
> 
> Since Spark already has a pluggable optimizer interface 
> (extendedOperatorOptimizationRules) and a framework to collect data 
> statistics for input data in data sources, the major tasks of this ticket are 
> to add plan rewriting rules to filter meaningful training data before feature 
> selections.
> As a pretty simple task, Spark might have a rule to filter out columns with 
> low variances (This process is corresponding to `VarianceThreshold` in 
> scikit-learn) by implicitly adding a `Project` node in the top of an user 
> plan. Then, the Spark optimizer might push down this `Project` node into leaf 
> nodes (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functionalities) in this ticket to track them.
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe 
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

2018-04-26 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Description: 
In machine learning and statistics, feature selection is one of useful 
techniques to choose a subset of relevant data in model construction for 
simplification of models and shorter training times, e.g., scikit-learn has 
some APIs for feature selection 
([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
selection is too time-consuming process if training data have a large number of 
columns and rows (For example, the number of columns could frequently go over 
1,000 in real business use cases).

An objective of this ticket is to implement plan rewriting rules in Spark 
Catalyst to filter meaningful training data before feature selection. We assume 
a workflow below from data extraction to model training;

!fig1.png!

In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
(the red box) of the raw data, sampling and feature selection apply to them. In 
real business use cases, it sometimes happens that raw training data have many 
meaningless columns because of historical reasons (e.g., redundant schema 
designs). So, if we could filter out these meaningless data in the phase of 
data extraction, we should efficiently process the data extraction itself and 
following feature selection. In the example above, we actually need not join 
the relation R3 because all the columns in the relation are filtered out in 
feature selection. Also, the join processing should be faster if we could 
sample data directly in the input data (R1 and R2). This optimized workflow is 
as following;

!fig2.png!

This optimization might be achived by rewriting a plan tree for data extraction 
as following;

!fig3.png!

Since Spark already has a pluggable optimizer interface 
(extendedOperatorOptimizationRules) and a framework to collect data statistics 
for input data in data sources, the major tasks of this ticket are to add plan 
rewriting rules to filter meaningful training data before feature selections.

As a pretty simple task, Spark might have a rule to filter out columns with low 
variances (This process is corresponding to `VarianceThreshold` in 
scikit-learn) by implicitly adding a `Project` node in the top of an user plan. 
Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and the plan execution could be significantly faster. 
Moreover, more sophisticated techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and 
other OSS functionalities) in this ticket to track them.

References:
 [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
or Not to Join?: Thinking Twice about Joins before Feature Selection, 
Proceedings of SIGMOD, 2016.
 [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.

  was:
In machine learning and statistics, feature selection is one of useful 
techniques to choose a subset of relevant data in model construction for 
simplification of models and shorter training times, e.g., scikit-learn has 
some APIs for feature selection 
([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
selection is too time-consuming process if training data have a large number of 
columns and rows (For example, the number of columns could frequently go over 
1,000 in real business use cases).

An objective of this ticket is to implement plan rewriting rules in Spark 
Catalyst to filter meaningful training data before feature selection. We assume 
a workflow below from data extraction to model training;

!fig1.png!

In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
(the red box) of the raw data, sampling and feature selection apply to them. In 
real business use cases, it sometimes happens that raw training data have many 
meaningless columns because of historical reasons (e.g., redundant schema 
designs). So, if we could filter out these meaningless data in the phase of 
data extraction, we should efficiently process the data extraction itself and 
following feature selection. In the example above, we actually need not join 
the relation R3 because all the columns in the relation are filtered out in 
feature selection. Also, the join processing should be faster if we could 
sample data directly in the input

[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

2018-04-26 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Attachment: fig2.png

> Plan rewriting rules to filter meaningful training data before feature 
> selections
> -
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
> Attachments: fig1.png, fig2.png, fig3.png
>
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times, e.g., scikit-learn has 
> some APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
> selection is too time-consuming process if training data have a large number 
> of columns and rows (For example, the number of columns could frequently go 
> over 1,000 in real business use cases).
> An objective of this ticket is to implement plan rewriting rules in Spark 
> Catalyst to filter meaningful training data before feature selection. We 
> assume a workflow below from data extraction to model training;
> !fig1.png!
> In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
> v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
> various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
> (the red box) of the raw data, sampling and feature selection apply to them. 
> In real business use cases, it sometimes happens that raw training data have 
> many meaningless columns because of historical reasons (e.g., redundant 
> schema designs). So, if we could filter out these meaningless data in the 
> phase of data extraction, we should efficiently process the data extraction 
> itself and following feature selection. In the example above, we actually 
> need not join the relation R3 because all the columns in the relation are 
> filtered out in feature selection. Also, the join processing should be faster 
> if we could sample data directly in the input data (R1 and R2). This 
> optimized workflow is as following;
> 
> This optimization might be achived by rewriting a plan tree for data 
> extraction as following;
> 
> Since Spark already has a pluggable optimizer interface 
> (extendedOperatorOptimizationRules) and a framework to collect data 
> statistics for input data in data sources, the major tasks of this ticket are 
> to add plan rewriting rules to filter meaningful training data before feature 
> selections.
> As a pretty simple task, Spark might have a rule to filter out columns with 
> low variances (This process is corresponding to `VarianceThreshold` in 
> scikit-learn) by implicitly adding a `Project` node in the top of an user 
> plan. Then, the Spark optimizer might push down this `Project` node into leaf 
> nodes (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functionalities) in this ticket to track them.
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe 
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

2018-04-26 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Attachment: fig1.png

> Plan rewriting rules to filter meaningful training data before feature 
> selections
> -
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
> Attachments: fig1.png, fig2.png, fig3.png
>
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times, e.g., scikit-learn has 
> some APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
> selection is too time-consuming process if training data have a large number 
> of columns and rows (For example, the number of columns could frequently go 
> over 1,000 in real business use cases).
> An objective of this ticket is to implement plan rewriting rules in Spark 
> Catalyst to filter meaningful training data before feature selection. We 
> assume a workflow below from data extraction to model training;
> !fig1.png!
> In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
> v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
> various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
> (the red box) of the raw data, sampling and feature selection apply to them. 
> In real business use cases, it sometimes happens that raw training data have 
> many meaningless columns because of historical reasons (e.g., redundant 
> schema designs). So, if we could filter out these meaningless data in the 
> phase of data extraction, we should efficiently process the data extraction 
> itself and following feature selection. In the example above, we actually 
> need not join the relation R3 because all the columns in the relation are 
> filtered out in feature selection. Also, the join processing should be faster 
> if we could sample data directly in the input data (R1 and R2). This 
> optimized workflow is as following;
> 
> This optimization might be achived by rewriting a plan tree for data 
> extraction as following;
> 
> Since Spark already has a pluggable optimizer interface 
> (extendedOperatorOptimizationRules) and a framework to collect data 
> statistics for input data in data sources, the major tasks of this ticket are 
> to add plan rewriting rules to filter meaningful training data before feature 
> selections.
> As a pretty simple task, Spark might have a rule to filter out columns with 
> low variances (This process is corresponding to `VarianceThreshold` in 
> scikit-learn) by implicitly adding a `Project` node in the top of an user 
> plan. Then, the Spark optimizer might push down this `Project` node into leaf 
> nodes (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functionalities) in this ticket to track them.
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe 
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

2018-04-26 Thread Takeshi Yamamuro (JIRA)

[
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

!fig1.png!

This optimization might be achived by rewriting a plan tree for data extraction
as following;

I will make pull requests as sub-tasks and put relevant activities (papers and
other OSS functionalities) in this ticket to track them.

References:
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join
or Not to Join?: Thinking Twice about Joins before Feature Selection,
Proceedings of SIGMOD, 2016.
[2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to
avoid when learning high-capacity classifiers?, Proceedings of the VLDB
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.

was:
In machine learning and statistics, feature selection is one of useful
techniques to choose a subset of relevant data in model construction for
simplification of models and shorter training times. scikit-learn has some APIs
for feature selection
([http://scikit-learn.org/stable/modules/feature_selection.html]), but this
selection is too time-consuming process if training data have a large number of
columns (the number could frequently go over 1,000 in business use cases).

An objective of this ticket is to add new optimizer rules in Spark to filter
meaningful training data before feature selection. As a pretty simple example,
Spark might be able to filter out columns with low variances (This process is
corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a
`Project` node in the top of an user plan. Then, the Spark optimizer might push
down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan
execution could be significantly faster. Moreover, more sophisticated
techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and
other OSS functionalities) in this ticket to track them.

References:
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join
or Not to Join?: Thinking Twice about Joins before Feature Selection,
Proceedings of SIGMOD, 2016.
[2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to
avoid when learning high-capacity classifiers?, Proceedings of the VLDB
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.

> Plan rewriting rules to filter meaningful training data

[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

2018-04-26 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Attachment: (was: fig1.png)

> Plan rewriting rules to filter meaningful training data before feature 
> selections
> -
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times, e.g., scikit-learn has 
> some APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
> selection is too time-consuming process if training data have a large number 
> of columns and rows (For example, the number of columns could frequently go 
> over 1,000 in real business use cases).
> An objective of this ticket is to implement plan rewriting rules in Spark 
> Catalyst to filter meaningful training data before feature selection. We 
> assume a workflow below from data extraction to model training;
> !fig1.png!
> In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
> v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
> various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
> (the red box) of the raw data, sampling and feature selection apply to them. 
> In real business use cases, it sometimes happens that raw training data have 
> many meaningless columns because of historical reasons (e.g., redundant 
> schema designs). So, if we could filter out these meaningless data in the 
> phase of data extraction, we should efficiently process the data extraction 
> itself and following feature selection. In the example above, we actually 
> need not join the relation R3 because all the columns in the relation are 
> filtered out in feature selection. Also, the join processing should be faster 
> if we could sample data directly in the input data (R1 and R2). This 
> optimized workflow is as following;
> 
> This optimization might be achived by rewriting a plan tree for data 
> extraction as following;
> 
> Since Spark already has a pluggable optimizer interface 
> (extendedOperatorOptimizationRules) and a framework to collect data 
> statistics for input data in data sources, the major tasks of this ticket are 
> to add plan rewriting rules to filter meaningful training data before feature 
> selections.
> As a pretty simple task, Spark might have a rule to filter out columns with 
> low variances (This process is corresponding to `VarianceThreshold` in 
> scikit-learn) by implicitly adding a `Project` node in the top of an user 
> plan. Then, the Spark optimizer might push down this `Project` node into leaf 
> nodes (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functionalities) in this ticket to track them.
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe 
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

2018-04-26 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Attachment: fig1.png

> Plan rewriting rules to filter meaningful training data before feature 
> selections
> -
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times. scikit-learn has some 
> APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
> selection is too time-consuming process if training data have a large number 
> of columns (the number could frequently go over 1,000 in business use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter 
> meaningful training data before feature selection. As a pretty simple 
> example, Spark might be able to filter out columns with low variances (This 
> process is corresponding to `VarianceThreshold` in scikit-learn) by 
> implicitly adding a `Project` node in the top of an user plan. Then, the 
> Spark optimizer might push down this `Project` node into leaf nodes (e.g., 
> `LogicalRelation`) and the plan execution could be significantly faster. 
> Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functionalities) in this ticket to track them.
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe 
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections

2018-04-20 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Summary: Plan rewriting rules to filter meaningful training data before 
feature selections  (was: Plan rewrting rules to filter meaningful training 
data before feature selections)

> Plan rewriting rules to filter meaningful training data before feature 
> selections
> -
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times. scikit-learn has some 
> APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
> selection is too time-consuming process if training data have a large number 
> of columns (the number could frequently go over 1,000 in business use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter 
> meaningful training data before feature selection. As a pretty simple 
> example, Spark might be able to filter out columns with low variances (This 
> process is corresponding to `VarianceThreshold` in scikit-learn) by 
> implicitly adding a `Project` node in the top of an user plan. Then, the 
> Spark optimizer might push down this `Project` node into leaf nodes (e.g., 
> `LogicalRelation`) and the plan execution could be significantly faster. 
> Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functionalities) in this ticket to track them.
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe 
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-198) Fix obsolete documentations for hivemall-on-spark

2018-04-20 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-198:
--
Summary: Fix obsolete documentations for hivemall-on-spark  (was: Fix 
documentations for hivemall-on-spark)

> Fix obsolete documentations for hivemall-on-spark
> -
>
> Key: HIVEMALL-198
> URL: https://issues.apache.org/jira/browse/HIVEMALL-198
> Project: Hivemall
>  Issue Type: Bug
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> Some documentations for hivemall-on-spark are obsolete, so we should fix 
> before the next release.
> https://hivemall.incubator.apache.org/userguide/spark/getting_started/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-198) Fix obsolete documentations for hivemall-on-spark

2018-04-20 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-198:
--
Labels: spark  (was: )

> Fix obsolete documentations for hivemall-on-spark
> -
>
> Key: HIVEMALL-198
> URL: https://issues.apache.org/jira/browse/HIVEMALL-198
> Project: Hivemall
>  Issue Type: Bug
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> Some documentations for hivemall-on-spark are obsolete, so we should fix 
> before the next release.
> https://hivemall.incubator.apache.org/userguide/spark/getting_started/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVEMALL-198) Fix documentations for hivemall-on-spark

2018-04-20 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-198:
-

 Summary: Fix documentations for hivemall-on-spark
 Key: HIVEMALL-198
 URL: https://issues.apache.org/jira/browse/HIVEMALL-198
 Project: Hivemall
  Issue Type: Bug
Reporter: Takeshi Yamamuro
Assignee: Takeshi Yamamuro


Some documentations for hivemall-on-spark are obsolete, so we should fix before 
the next release.

https://hivemall.incubator.apache.org/userguide/spark/getting_started/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before feature selections

2018-04-04 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425080#comment-16425080
 ] 

Takeshi Yamamuro commented on HIVEMALL-181:
---

Great work! Next time please give me the details of the work offline (Thanks 
for the link and I'll check later by myself).
Anyway, in this ticket, I'd like to focus on the integration of the Spark 
optimizer and some parts of techniques for feature selections. 

> Plan rewrting rules to filter meaningful training data before feature 
> selections
> 
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times. scikit-learn has some 
> APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
> selection is too time-consuming process if training data have a large number 
> of columns (the number could frequently go over 1,000 in business use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter 
> meaningful training data before feature selection. As a pretty simple 
> example, Spark might be able to filter out columns with low variances (This 
> process is corresponding to `VarianceThreshold` in scikit-learn) by 
> implicitly adding a `Project` node in the top of an user plan. Then, the 
> Spark optimizer might push down this `Project` node into leaf nodes (e.g., 
> `LogicalRelation`) and the plan execution could be significantly faster. 
> Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functionalities) in this ticket to track them.
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe 
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVEMALL-185) Add an optimizer rule to push down a Sample plan node into fact tables

2018-04-03 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-185:
-

 Summary: Add an optimizer rule to push down a Sample plan node 
into fact tables
 Key: HIVEMALL-185
 URL: https://issues.apache.org/jira/browse/HIVEMALL-185
 Project: Hivemall
  Issue Type: Sub-task
Reporter: Takeshi Yamamuro
Assignee: Takeshi Yamamuro


Sampling is a common technique to extract a part of data in joined relations 
(fact tables and dimension tables) for training data. The optimizer in Spark 
cannot push down a Sample plan node into larger fact tables because this node 
is non-deterministic. But, by using RI constraints, we could push down this 
node into fact tables in some cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-184) Add an optimizer rule to filter out columns by using Mutual Information

2018-04-03 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-184:
--
Labels: spark  (was: )

> Add an optimizer rule to filter out columns by using Mutual Information
> ---
>
> Key: HIVEMALL-184
> URL: https://issues.apache.org/jira/browse/HIVEMALL-184
> Project: Hivemall
>  Issue Type: Sub-task
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> Mutual Information (MI) is an indicator to find and quantify dependencies 
> between variables, so the indicator is useful to filter out columns in 
> feature selection. Nearest-neighbor distances are frequently used to estimate 
> MI [1], so we could use the distances to compute MI between columns for each 
> relation when running an ANALYZE command. Then, we could filter out "similar" 
> columns in the optimizer phase by referring a new threshold (e.g. 
> `spark.sql.optimizer.featureSelection.mutualInfoThreshold`).
> In another story, we need to consider a light-weight way to update MI when 
> re-running an ANALYZE command. A recent study [2] proposed a sophisticated 
> technique to compute MI for dynamic data.
> [1] Dafydd Evans, A computationally efficient estimator for mutual 
> information.
> In Proceedings of the Royal Society of London A: Mathematical, Physical
> and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008.
> [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information
> Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-184) Add an optimizer rule to filter out columns by using Mutual Information

2018-04-03 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-184:
--
Description: 
Mutual Information (MI) is an indicator to find and quantify dependencies 
between variables, so the indicator is useful to filter out columns in feature 
selection. Nearest-neighbor distances are frequently used to estimate MI [1], 
so we could use the distances to compute MI between columns for each relation 
when running an ANALYZE command. Then, we could filter out "similar" columns in 
the optimizer phase by referring a new threshold (e.g. 
`spark.sql.optimizer.featureSelection.mutualInfoThreshold`).

In another story, we need to consider a light-weight way to update MI when 
re-running an ANALYZE command. A recent study [2] proposed a sophisticated 
technique to compute MI for dynamic data.

[1] Dafydd Evans, A computationally efficient estimator for mutual information. 
In Proceedings of the Royal Society of London A: Mathematical, Physical
 and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008.
 [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information 
Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018.

  was:
Mutual Information (MI) is an indicator to find and quantify dependencies 
between variables, so the indicator is useful to filter out columns in feature 
selection. Nearest-neighbor distances are frequently used to estimate MI [1], 
so we could use the distances to compute MI between columns for each relation 
when running an ANALYZE command. Then, we could filter out "similar" columns in 
the optimizer phase by referring a new threshold (e.g. 
`spark.sql.optimizer.featureSelection.mutualInfoThreshold`).

In another story, we need to consider a light-weight way to update MI when 
re-running an ANALYZE command. A recent study [2] proposed a sophisticated 
technique to compute MI for dynamic data.

[1] Dafydd Evans, A computationally efficient estimator for mutual information.
In Proceedings of the Royal Society of London A: Mathematical, Physical
and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008.
[2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information
Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018.



> Add an optimizer rule to filter out columns by using Mutual Information
> ---
>
> Key: HIVEMALL-184
> URL: https://issues.apache.org/jira/browse/HIVEMALL-184
> Project: Hivemall
>  Issue Type: Sub-task
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> Mutual Information (MI) is an indicator to find and quantify dependencies 
> between variables, so the indicator is useful to filter out columns in 
> feature selection. Nearest-neighbor distances are frequently used to estimate 
> MI [1], so we could use the distances to compute MI between columns for each 
> relation when running an ANALYZE command. Then, we could filter out "similar" 
> columns in the optimizer phase by referring a new threshold (e.g. 
> `spark.sql.optimizer.featureSelection.mutualInfoThreshold`).
> In another story, we need to consider a light-weight way to update MI when 
> re-running an ANALYZE command. A recent study [2] proposed a sophisticated 
> technique to compute MI for dynamic data.
> [1] Dafydd Evans, A computationally efficient estimator for mutual 
> information. In Proceedings of the Royal Society of London A: Mathematical, 
> Physical
>  and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008.
>  [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual 
> Information Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVEMALL-184) Add an optimizer rule to filter out columns by using Mutual Information

2018-04-03 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-184:
-

 Summary: Add an optimizer rule to filter out columns by using 
Mutual Information
 Key: HIVEMALL-184
 URL: https://issues.apache.org/jira/browse/HIVEMALL-184
 Project: Hivemall
  Issue Type: Sub-task
Reporter: Takeshi Yamamuro
Assignee: Takeshi Yamamuro


Mutual Information (MI) is an indicator to find and quantify dependencies 
between variables, so the indicator is useful to filter out columns in feature 
selection. Nearest-neighbor distances are frequently used to estimate MI [1], 
so we could use the distances to compute MI between columns for each relation 
when running an ANALYZE command. Then, we could filter out "similar" columns in 
the optimizer phase by referring a new threshold (e.g. 
`spark.sql.optimizer.featureSelection.mutualInfoThreshold`).

In another story, we need to consider a light-weight way to update MI when 
re-running an ANALYZE command. A recent study [2] proposed a sophisticated 
technique to compute MI for dynamic data.

[1] Dafydd Evans, A computationally efficient estimator for mutual information.
In Proceedings of the Royal Society of London A: Mathematical, Physical
and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008.
[2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information
Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before feature selections

2018-04-03 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Summary: Plan rewrting rules to filter meaningful training data before 
feature selections  (was: Plan rewrting rules to filter meaningful training 
data before future selections)

> Plan rewrting rules to filter meaningful training data before feature 
> selections
> 
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times. scikit-learn has some 
> APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
> selection is too time-consuming process if training data have a large number 
> of columns (the number could frequently go over 1,000 in business use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter 
> meaningful training data before feature selection. As a pretty simple 
> example, Spark might be able to filter out columns with low variances (This 
> process is corresponding to `VarianceThreshold` in scikit-learn) by 
> implicitly adding a `Project` node in the top of an user plan. Then, the 
> Spark optimizer might push down this `Project` node into leaf nodes (e.g., 
> `LogicalRelation`) and the plan execution could be significantly faster. 
> Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functionalities) in this ticket to track them.
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe 
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before future selections

2018-04-03 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Description: 
In machine learning and statistics, feature selection is one of useful 
techniques to choose a subset of relevant data in model construction for 
simplification of models and shorter training times. scikit-learn has some APIs 
for feature selection 
([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
selection is too time-consuming process if training data have a large number of 
columns (the number could frequently go over 1,000 in business use cases).

An objective of this ticket is to add new optimizer rules in Spark to filter 
meaningful training data before feature selection. As a pretty simple example, 
Spark might be able to filter out columns with low variances (This process is 
corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
`Project` node in the top of an user plan. Then, the Spark optimizer might push 
down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan 
execution could be significantly faster. Moreover, more sophisticated 
techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and 
other OSS functionalities) in this ticket to track them.

References:
 [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
or Not to Join?: Thinking Twice about Joins before Feature Selection, 
Proceedings of SIGMOD, 2016.
 [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.

  was:
In machine learning and statistics, feature selection is one of useful 
techniques to choose a subset of relevant data in model construction for 
simplification of models and shorter training times. scikit-learn has some APIs 
for feature selection 
([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
selection is too time-consuming process if training data have a large number of 
columns (the number could frequently go over 1,000 in business use cases).

An objective of this ticket is to add new optimizer rules in Spark to filter 
meaningful training data before feature selection. As a simple example, Spark 
might be able to filter out columns with low variances (This process is 
corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
`Project` node in the top of an user plan.
 Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and the plan execution could be significantly faster. 
Moreover, more sophicated techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and 
other OSS functinalities) in this ticket to track them.

References:
 [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
or Not to Join?: Thinking Twice about Joins before Feature Selection, 
Proceedings of SIGMOD, 2016.
 [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.


> Plan rewrting rules to filter meaningful training data before future 
> selections
> ---
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times. scikit-learn has some 
> APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
> selection is too time-consuming process if training data have a large number 
> of columns (the number could frequently go over 1,000 in business use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter 
> meaningful training data before feature selection. As a pretty simple 
> example, Spark might be able to filter out columns with low variances (This 
> process is corresponding to `VarianceThreshold` in scikit-learn) by 
> implicitly adding a `Project` node in the top of an user plan. Then, the 
> Spark optimizer might push down this `Project` node into leaf nodes (e.g., 
> `LogicalRelation`) and the plan execution could be significantly faster. 
> Moreover, more sophisticated techniques

[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before future selections

2018-04-03 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Description: 
In machine learning and statistics, feature selection is one of useful 
techniques to choose a subset of relevant data in model construction for 
simplification of models and shorter training times. scikit-learn has some APIs 
for feature selection 
([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
selection is too time-consuming process if training data have a large number of 
columns (the number could frequently go over 1,000 in business use cases).

An objective of this ticket is to add new optimizer rules in Spark to filter 
meaningful training data before feature selection. As a simple example, Spark 
might be able to filter out columns with low variances (This process is 
corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
`Project` node in the top of an user plan.
 Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and the plan execution could be significantly faster. 
Moreover, more sophicated techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and 
other OSS functinalities) in this ticket to track them.

References:
 [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
or Not to Join?: Thinking Twice about Joins before Feature Selection, 
Proceedings of SIGMOD, 2016.
 [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.

  was:
In machine learning and statistics, feature selection is one of useful 
techniques to choose a subset of relevant data in model construction for 
simplification of models and shorter training times. scikit-learn has some APIs 
for feature selection 
([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
selection is too time-consuming process if training data have a large number of 
columns (the number could frequently go over 1,000 in business use cases).

An objective of this ticket is to add new optimizer rules in Spark to filter 
out meaningless columns before feature selection. As a simple example, Spark 
might be able to filter out columns with low variances (This process is 
corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
`Project` node in the top of an user plan.
 Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and the plan execution could be significantly faster. 
Moreover, more sophicated techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and 
other OSS functinalities) in this ticket to track them.

References:
 [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
or Not to Join?: Thinking Twice about Joins before Feature Selection, 
Proceedings of SIGMOD, 2016.
 [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.


> Plan rewrting rules to filter meaningful training data before future 
> selections
> ---
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times. scikit-learn has some 
> APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
> selection is too time-consuming process if training data have a large number 
> of columns (the number could frequently go over 1,000 in business use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter 
> meaningful training data before feature selection. As a simple example, Spark 
> might be able to filter out columns with low variances (This process is 
> corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
> `Project` node in the top of an user plan.
>  Then, the Spark optimizer might push down this `Project` node into leaf 
> nodes (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophicated techniques have been proposed in

[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter out meaningful training data before future selections

2018-04-03 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Description: 
In machine learning and statistics, feature selection is one of useful 
techniques to choose a subset of relevant data in model construction for 
simplification of models and shorter training times. scikit-learn has some APIs 
for feature selection 
([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
selection is too time-consuming process if training data have a large number of 
columns (the number could frequently go over 1,000 in business use cases).

An objective of this ticket is to add new optimizer rules in Spark to filter 
out meaningless columns before feature selection. As a simple example, Spark 
might be able to filter out columns with low variances (This process is 
corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
`Project` node in the top of an user plan.
 Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and the plan execution could be significantly faster. 
Moreover, more sophicated techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and 
other OSS functinalities) in this ticket to track them.

References:
 [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
or Not to Join?: Thinking Twice about Joins before Feature Selection, 
Proceedings of SIGMOD, 2016.
 [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.

  was:
In machine learning and statistics, feature selection is a useful techniqe to 
choose a subset of relevant features in model construction for simplification 
of models and shorter training times. scikit-learn has some APIs for feature 
selection (http://scikit-learn.org/stable/modules/feature_selection.html), but 
this selection is too time-consuming process if training data have a large 
number of columns (the number could frequently go over 1,000 in bisiness use 
cases).

An objective of this ticket is to add new optimizer rules in Spark to filter 
out meaningless columns before feature selection.  As a simple example, Spark 
might be able to filter out columns with low variances (This process is 
corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
`Project` node in the top of an user plan.
Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and the plan execution could be significantly faster. 
Moreover, more sophicated techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and 
other OSS functinalities) in this ticket to track them.

References:
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or 
Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings 
of SIGMOD, 2016.
[2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017. 


> Plan rewrting rules to filter out meaningful training data before future 
> selections
> ---
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times. scikit-learn has some 
> APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
> selection is too time-consuming process if training data have a large number 
> of columns (the number could frequently go over 1,000 in business use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter 
> out meaningless columns before feature selection. As a simple example, Spark 
> might be able to filter out columns with low variances (This process is 
> corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
> `Project` node in the top of an user plan.
>  Then, the Spark optimizer might push down this `Project` node into leaf 
> nodes (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophicated techniques have been proposed in

[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before future selections

2018-04-03 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Summary: Plan rewrting rules to filter meaningful training data before 
future selections  (was: Plan rewrting rules to filter out meaningful training 
data before future selections)

> Plan rewrting rules to filter meaningful training data before future 
> selections
> ---
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times. scikit-learn has some 
> APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this 
> selection is too time-consuming process if training data have a large number 
> of columns (the number could frequently go over 1,000 in business use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter 
> out meaningless columns before feature selection. As a simple example, Spark 
> might be able to filter out columns with low variances (This process is 
> corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
> `Project` node in the top of an user plan.
>  Then, the Spark optimizer might push down this `Project` node into leaf 
> nodes (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophicated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functinalities) in this ticket to track them.
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe 
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter out meaningful training data before future selections

2018-04-03 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Summary: Plan rewrting rules to filter out meaningful training data before 
future selections  (was: Plan rewrting rules to filter out meaningless columns 
before future selections)

> Plan rewrting rules to filter out meaningful training data before future 
> selections
> ---
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> In machine learning and statistics, feature selection is a useful techniqe to 
> choose a subset of relevant features in model construction for simplification 
> of models and shorter training times. scikit-learn has some APIs for feature 
> selection (http://scikit-learn.org/stable/modules/feature_selection.html), 
> but this selection is too time-consuming process if training data have a 
> large number of columns (the number could frequently go over 1,000 in 
> bisiness use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter 
> out meaningless columns before feature selection.  As a simple example, Spark 
> might be able to filter out columns with low variances (This process is 
> corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
> `Project` node in the top of an user plan.
> Then, the Spark optimizer might push down this `Project` node into leaf nodes 
> (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophicated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functinalities) in this ticket to track them.
> References:
> [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
> [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
> avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVEMALL-183) Add an optimizer rule to prune joins without significantly reducing ML accuracy

2018-04-01 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVEMALL-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421824#comment-16421824
 ] 

Takeshi Yamamuro commented on HIVEMALL-183:
---

Spark currently does not support FK constraints, so we need to track a Spark 
Jira ticket to support RIC functionalities in 
https://issues.apache.org/jira/browse/SPARK-19842

> Add an optimizer rule to prune joins without significantly reducing ML 
> accuracy 
> 
>
> Key: HIVEMALL-183
> URL: https://issues.apache.org/jira/browse/HIVEMALL-183
> Project: Hivemall
>  Issue Type: Sub-task
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> An objective of this ticket is to implement the proposed technique in the 
> paper [1] below;
> [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-183) Add an optimizer rule to prune joins without significantly reducing ML accuracy

2018-03-29 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-183:
--
Description: 
An objective of this ticket to implement the proposed technique in a paper [1] 
below;
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or 
Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings 
of SIGMOD, 2016.

  was:
An objective of this ticket to implement the proposed technique in a paper 
below;
without significantly reducing
ML accuracy


> Add an optimizer rule to prune joins without significantly reducing ML 
> accuracy 
> 
>
> Key: HIVEMALL-183
> URL: https://issues.apache.org/jira/browse/HIVEMALL-183
> Project: Hivemall
>  Issue Type: Sub-task
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> An objective of this ticket to implement the proposed technique in a paper 
> [1] below;
> [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-183) Add an optimizer rule to prune joins without significantly reducing ML accuracy

2018-03-29 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-183:
--
Description: 
An objective of this ticket is to implement the proposed technique in the paper 
[1] below;
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or 
Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings 
of SIGMOD, 2016.

  was:
An objective of this ticket to implement the proposed technique in a paper [1] 
below;
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or 
Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings 
of SIGMOD, 2016.


> Add an optimizer rule to prune joins without significantly reducing ML 
> accuracy 
> 
>
> Key: HIVEMALL-183
> URL: https://issues.apache.org/jira/browse/HIVEMALL-183
> Project: Hivemall
>  Issue Type: Sub-task
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> An objective of this ticket is to implement the proposed technique in the 
> paper [1] below;
> [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-183) Add an optimizer rule to prune joins without significantly reducing ML accuracy

2018-03-29 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-183:
--
Labels: spark  (was: )

> Add an optimizer rule to prune joins without significantly reducing ML 
> accuracy 
> 
>
> Key: HIVEMALL-183
> URL: https://issues.apache.org/jira/browse/HIVEMALL-183
> Project: Hivemall
>  Issue Type: Sub-task
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVEMALL-183) Add an optimizer rule to prune joins without significantly reducing ML accuracy

2018-03-29 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-183:
-

 Summary: Add an optimizer rule to prune joins without 
significantly reducing ML accuracy 
 Key: HIVEMALL-183
 URL: https://issues.apache.org/jira/browse/HIVEMALL-183
 Project: Hivemall
  Issue Type: Sub-task
Reporter: Takeshi Yamamuro






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-182) Add an optimizer rule to filter out columns with low variances

2018-03-29 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-182:
--
Labels: spark  (was: )

> Add an optimizer rule to filter out columns with low variances
> --
>
> Key: HIVEMALL-182
> URL: https://issues.apache.org/jira/browse/HIVEMALL-182
> Project: Hivemall
>  Issue Type: Sub-task
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (HIVEMALL-182) Add an optimizer rule to filter out columns with low variances

2018-03-29 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro reassigned HIVEMALL-182:
-

Assignee: Takeshi Yamamuro

> Add an optimizer rule to filter out columns with low variances
> --
>
> Key: HIVEMALL-182
> URL: https://issues.apache.org/jira/browse/HIVEMALL-182
> Project: Hivemall
>  Issue Type: Sub-task
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter out meaningless columns before future selections

2018-03-29 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Description: 
In machine learning and statistics, feature selection is a useful techniqe to 
choose a subset of relevant features in model construction for simplification 
of models and shorter training times. scikit-learn has some APIs for feature 
selection (http://scikit-learn.org/stable/modules/feature_selection.html), but 
this selection is too time-consuming process if training data have a large 
number of columns (the number could frequently go over 1,000 in bisiness use 
cases).

An objective of this ticket is to add new optimizer rules in Spark to filter 
out meaningless columns before feature selection.  As a simple example, Spark 
might be able to filter out columns with low variances (This process is 
corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
`Project` node in the top of an user plan.
Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and the plan execution could be significantly faster. 
Moreover, more sophicated techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and 
other OSS functinalities) in this ticket to track them.

References:
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or 
Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings 
of SIGMOD, 2016.
[2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017. 

  was:
In machine learning and statistics, feature selection is a useful techniqe to 
choose a subset of relevant features in model construction for simplification 
of models and shorter training times. scikit-learn has some APIs for feature 
selection (http://scikit-learn.org/stable/modules/feature_selection.html), but 
this selection is too time-consuming process if training data have a large 
number of columns (the number could frequently go over 1,000 in bisiness use 
cases).

An objective of this ticket is to add new optimizer rules in Spark to filter 
out meaningless columns before feature selection. 
As a simple example, Spark might be able to filter out columns with low 
variances (This process is corresponding to `VarianceThreshold` in scikit-learn)
by implicitly adding a `Project` node in the top of an user plan.
Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and
the plan execution could be significantly faster.
Moreover, more sophicated techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and 
other OSS functinalities)
in this ticket to track them.

References:
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or 
Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings 
of SIGMOD, 2016.
[2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017. 


> Plan rewrting rules to filter out meaningless columns before future selections
> --
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> In machine learning and statistics, feature selection is a useful techniqe to 
> choose a subset of relevant features in model construction for simplification 
> of models and shorter training times. scikit-learn has some APIs for feature 
> selection (http://scikit-learn.org/stable/modules/feature_selection.html), 
> but this selection is too time-consuming process if training data have a 
> large number of columns (the number could frequently go over 1,000 in 
> bisiness use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter 
> out meaningless columns before feature selection.  As a simple example, Spark 
> might be able to filter out columns with low variances (This process is 
> corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a 
> `Project` node in the top of an user plan.
> Then, the Spark optimizer might push down this `Project` node into leaf nodes 
> (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophicated techniques have been proposed in [1, 2].
> I will make pull

[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter out meaningless columns before future selections

2018-03-29 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Labels: spark  (was: )

> Plan rewrting rules to filter out meaningless columns before future selections
> --
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> In machine learning and statistics, feature selection is a useful techniqe to 
> choose a subset of relevant features
> in model construction for simplification of models and shorter training times.
> scikit-learn has some APIs for feature selection 
> (http://scikit-learn.org/stable/modules/feature_selection.html), but
> this selection is too time-consuming process if training data have a large 
> number of columns
> (the number could frequently go over 1,000 in bisiness use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter 
> out meaningless columns before feature selection. 
> As a simple example, Spark might be able to filter out columns with low 
> variances (This process is corresponding to `VarianceThreshold` in 
> scikit-learn)
> by implicitly adding a `Project` node in the top of an user plan.
> Then, the Spark optimizer might push down this `Project` node into leaf nodes 
> (e.g., `LogicalRelation`) and
> the plan execution could be significantly faster.
> Moreover, more sophicated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functinalities)
> in this ticket to track them.
> References:
> [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
> [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
> avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter out meaningless columns before future selections

2018-03-29 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--
Description: 
In machine learning and statistics, feature selection is a useful techniqe to 
choose a subset of relevant features in model construction for simplification 
of models and shorter training times. scikit-learn has some APIs for feature 
selection (http://scikit-learn.org/stable/modules/feature_selection.html), but 
this selection is too time-consuming process if training data have a large 
number of columns (the number could frequently go over 1,000 in bisiness use 
cases).

An objective of this ticket is to add new optimizer rules in Spark to filter 
out meaningless columns before feature selection. 
As a simple example, Spark might be able to filter out columns with low 
variances (This process is corresponding to `VarianceThreshold` in scikit-learn)
by implicitly adding a `Project` node in the top of an user plan.
Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and
the plan execution could be significantly faster.
Moreover, more sophicated techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and 
other OSS functinalities)
in this ticket to track them.

References:
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or 
Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings 
of SIGMOD, 2016.
[2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017. 

  was:
In machine learning and statistics, feature selection is a useful techniqe to 
choose a subset of relevant features
in model construction for simplification of models and shorter training times.
scikit-learn has some APIs for feature selection 
(http://scikit-learn.org/stable/modules/feature_selection.html), but
this selection is too time-consuming process if training data have a large 
number of columns
(the number could frequently go over 1,000 in bisiness use cases).

An objective of this ticket is to add new optimizer rules in Spark to filter 
out meaningless columns before feature selection. 
As a simple example, Spark might be able to filter out columns with low 
variances (This process is corresponding to `VarianceThreshold` in scikit-learn)
by implicitly adding a `Project` node in the top of an user plan.
Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and
the plan execution could be significantly faster.
Moreover, more sophicated techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and 
other OSS functinalities)
in this ticket to track them.

References:
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or 
Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings 
of SIGMOD, 2016.
[2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017. 


> Plan rewrting rules to filter out meaningless columns before future selections
> --
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
>
> In machine learning and statistics, feature selection is a useful techniqe to 
> choose a subset of relevant features in model construction for simplification 
> of models and shorter training times. scikit-learn has some APIs for feature 
> selection (http://scikit-learn.org/stable/modules/feature_selection.html), 
> but this selection is too time-consuming process if training data have a 
> large number of columns (the number could frequently go over 1,000 in 
> bisiness use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter 
> out meaningless columns before feature selection. 
> As a simple example, Spark might be able to filter out columns with low 
> variances (This process is corresponding to `VarianceThreshold` in 
> scikit-learn)
> by implicitly adding a `Project` node in the top of an user plan.
> Then, the Spark optimizer might push down this `Project` node into leaf nodes 
> (e.g., `LogicalRelation`) and
> the plan execution could be significantly faster.
> Moreover, more sophicated techniques have been proposed in [1, 2].
> I will make pull

[jira] [Created] (HIVEMALL-181) Plan rewrting rules to filter out meaningless columns before future selections

2018-03-29 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-181:
-

 Summary: Plan rewrting rules to filter out meaningless columns 
before future selections
 Key: HIVEMALL-181
 URL: https://issues.apache.org/jira/browse/HIVEMALL-181
 Project: Hivemall
  Issue Type: Improvement
Reporter: Takeshi Yamamuro
Assignee: Takeshi Yamamuro


In machine learning and statistics, feature selection is a useful techniqe to 
choose a subset of relevant features
in model construction for simplification of models and shorter training times.
scikit-learn has some APIs for feature selection 
(http://scikit-learn.org/stable/modules/feature_selection.html), but
this selection is too time-consuming process if training data have a large 
number of columns
(the number could frequently go over 1,000 in bisiness use cases).

An objective of this ticket is to add new optimizer rules in Spark to filter 
out meaningless columns before feature selection. 
As a simple example, Spark might be able to filter out columns with low 
variances (This process is corresponding to `VarianceThreshold` in scikit-learn)
by implicitly adding a `Project` node in the top of an user plan.
Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and
the plan execution could be significantly faster.
Moreover, more sophicated techniques have been proposed in [1, 2].

I will make pull requests as sub-tasks and put relevant activities (papers and 
other OSS functinalities)
in this ticket to track them.

References:
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or 
Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings 
of SIGMOD, 2016.
[2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-180) Drop the Spark-2.0 support

2018-03-28 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-180:
--
Issue Type: Improvement  (was: Sub-task)
Parent: (was: HIVEMALL-152)

> Drop the Spark-2.0 support
> --
>
> Key: HIVEMALL-180
> URL: https://issues.apache.org/jira/browse/HIVEMALL-180
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
> Fix For: 0.5.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-180) Drop the Spark-2.0 support

2018-03-28 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-180:
--
Labels: spark  (was: )

> Drop the Spark-2.0 support
> --
>
> Key: HIVEMALL-180
> URL: https://issues.apache.org/jira/browse/HIVEMALL-180
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
> Fix For: 0.5.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (HIVEMALL-180) Drop the Spark-2.0 support

2018-03-28 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro reassigned HIVEMALL-180:
-

Assignee: Takeshi Yamamuro

> Drop the Spark-2.0 support
> --
>
> Key: HIVEMALL-180
> URL: https://issues.apache.org/jira/browse/HIVEMALL-180
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: spark
> Fix For: 0.5.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVEMALL-180) Drop the Spark-2.0 support

2018-03-28 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-180:
-

 Summary: Drop the Spark-2.0 support
 Key: HIVEMALL-180
 URL: https://issues.apache.org/jira/browse/HIVEMALL-180
 Project: Hivemall
  Issue Type: Improvement
Reporter: Takeshi Yamamuro
 Fix For: 0.5.2






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVEMALL-179) Support Spark 2.3

2018-03-28 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-179:
--
Labels: spark  (was: )

> Support Spark 2.3
> -
>
> Key: HIVEMALL-179
> URL: https://issues.apache.org/jira/browse/HIVEMALL-179
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Makoto Yui
>Assignee: Takeshi Yamamuro
>Priority: Blocker
>  Labels: spark
> Fix For: 0.5.2
>
>
> Support Spark 2.3 (with deprecating old spark support?)
> https://spark.apache.org/releases/spark-release-2-3-0.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVEMALL-136) Support train_classifier and train_regressor for Spark

2017-07-27 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-136:
-

 Summary: Support train_classifier and train_regressor for Spark
 Key: HIVEMALL-136
 URL: https://issues.apache.org/jira/browse/HIVEMALL-136
 Project: Hivemall
  Issue Type: Improvement
Reporter: Takeshi Yamamuro


This ticket is to support GeneralRegressorUDTF and GeneralClassifierUDTF.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVEMALL-134) Create Stanalone API for Scala/Java

2017-07-13 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVEMALL-134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086880#comment-16086880
 ] 

Takeshi Yamamuro commented on HIVEMALL-134:
---

You mean we run HiveUDF w/o hive?

> Create Stanalone API for Scala/Java
> ---
>
> Key: HIVEMALL-134
> URL: https://issues.apache.org/jira/browse/HIVEMALL-134
> Project: Hivemall
>  Issue Type: Wish
>Reporter: Makoto Yui
>
> Standalone API of Hivemall would be useful for standalone application with 
> enough local memory.
> A good example of a standalone API is Smile's Scala API.
> https://haifengl.github.io/smile/
> https://github.com/haifengl/smile/tree/master/scala/src/main/scala/smile



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Closed] (HIVEMALL-116) Add documentation about SQL in Spark

2017-07-11 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro closed HIVEMALL-116.
-
Resolution: Fixed

> Add documentation about SQL in Spark
> 
>
> Key: HIVEMALL-116
> URL: https://issues.apache.org/jira/browse/HIVEMALL-116
> Project: Hivemall
>  Issue Type: Sub-task
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>  Labels: docuentation, spark
>
> We currently have documentation about DataFrame in Spark. So, it needs to add 
> documentation for  SQL in Spark.
> https://hivemall.incubator.apache.org/userguide/spark/binaryclass/a9a_df.html
> https://hivemall.incubator.apache.org/userguide/spark/regression/e2006_df.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVEMALL-133) Support spark-v2.2 in the hivemalls-spark module

2017-07-11 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-133:
-

 Summary: Support spark-v2.2 in the hivemalls-spark module
 Key: HIVEMALL-133
 URL: https://issues.apache.org/jira/browse/HIVEMALL-133
 Project: Hivemall
  Issue Type: Improvement
Reporter: Takeshi Yamamuro


Since Spark-v2.2 available now, we support it in /spark module.
https://databricks.com/blog/2017/07/11/introducing-apache-spark-2-2.html?utm_campaign=Engineering%20Blog_content=57373960_medium=social_source=twitter



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVEMALL-133) Support spark-v2.2 in the hivemalls-spark module

2017-07-11 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-133:
--
Labels: spark  (was: )

> Support spark-v2.2 in the hivemalls-spark module
> 
>
> Key: HIVEMALL-133
> URL: https://issues.apache.org/jira/browse/HIVEMALL-133
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>  Labels: spark
>
> Since Spark-v2.2 available now, we support it in /spark module.
> https://databricks.com/blog/2017/07/11/introducing-apache-spark-2-2.html?utm_campaign=Engineering%20Blog_content=57373960_medium=social_source=twitter



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVEMALL-129) Support wrapper implementation for python in pyspark

2017-07-05 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-129:
-

 Summary: Support wrapper implementation for python in pyspark
 Key: HIVEMALL-129
 URL: https://issues.apache.org/jira/browse/HIVEMALL-129
 Project: Hivemall
  Issue Type: Improvement
Reporter: Takeshi Yamamuro
Priority: Minor


The master only supports wrapper implementation for Scala, but most users use 
pyspark in Spark. So, it might help to implement the wrapper for python.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVEMALL-129) Support wrapper implementation for python in pyspark

2017-07-05 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-129:
--
Labels: spark  (was: )

> Support wrapper implementation for python in pyspark
> 
>
> Key: HIVEMALL-129
> URL: https://issues.apache.org/jira/browse/HIVEMALL-129
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Priority: Minor
>  Labels: spark
>
> The master only supports wrapper implementation for Scala, but most users use 
> pyspark in Spark. So, it might help to implement the wrapper for python.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVEMALL-117) Add hivemall in SparkPackages

2017-06-12 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-117:
--
Labels: spark  (was: )

> Add hivemall in SparkPackages
> -
>
> Key: HIVEMALL-117
> URL: https://issues.apache.org/jira/browse/HIVEMALL-117
> Project: Hivemall
>  Issue Type: Bug
>Reporter: Takeshi Yamamuro
>  Labels: spark
>
> We might add hivemall in SparkPackages after released in Apache:
> https://spark-packages.org/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVEMALL-117) Add hivemall in SparkPackages

2017-06-12 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-117:
-

 Summary: Add hivemall in SparkPackages
 Key: HIVEMALL-117
 URL: https://issues.apache.org/jira/browse/HIVEMALL-117
 Project: Hivemall
  Issue Type: Bug
Reporter: Takeshi Yamamuro


We might add hivemall in SparkPackages after released in Apache:
https://spark-packages.org/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVEMALL-116) Add documentation about SQL in Spark

2017-06-12 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-116:
--
Description: 
We currently have documentation about DataFrame in Spark. So, it needs to add 
documentation for  SQL in Spark.
https://hivemall.incubator.apache.org/userguide/spark/binaryclass/a9a_df.html
https://hivemall.incubator.apache.org/userguide/spark/regression/e2006_df.html

  was:
We currently have documentation about DataFrame in Spark. So, it helps to add 
documentation for  SQL in Spark.
https://hivemall.incubator.apache.org/userguide/spark/binaryclass/a9a_df.html
https://hivemall.incubator.apache.org/userguide/spark/regression/e2006_df.html


> Add documentation about SQL in Spark
> 
>
> Key: HIVEMALL-116
> URL: https://issues.apache.org/jira/browse/HIVEMALL-116
> Project: Hivemall
>  Issue Type: Sub-task
>Reporter: Takeshi Yamamuro
>  Labels: docuentation, spark
>
> We currently have documentation about DataFrame in Spark. So, it needs to add 
> documentation for  SQL in Spark.
> https://hivemall.incubator.apache.org/userguide/spark/binaryclass/a9a_df.html
> https://hivemall.incubator.apache.org/userguide/spark/regression/e2006_df.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVEMALL-116) Add documentation about SQL in Spark

2017-06-12 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-116:
--
Labels: docuentation spark  (was: )

> Add documentation about SQL in Spark
> 
>
> Key: HIVEMALL-116
> URL: https://issues.apache.org/jira/browse/HIVEMALL-116
> Project: Hivemall
>  Issue Type: Sub-task
>Reporter: Takeshi Yamamuro
>  Labels: docuentation, spark
>
> We currently have documentation about DataFrame in Spark. So, it helps to add 
> documentation for  SQL in Spark.
> https://hivemall.incubator.apache.org/userguide/spark/binaryclass/a9a_df.html
> https://hivemall.incubator.apache.org/userguide/spark/regression/e2006_df.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (HIVEMALL-116) Add documentation about SQL in Spark

2017-06-12 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-116:
-

 Summary: Add documentation about SQL in Spark
 Key: HIVEMALL-116
 URL: https://issues.apache.org/jira/browse/HIVEMALL-116
 Project: Hivemall
  Issue Type: Sub-task
Reporter: Takeshi Yamamuro


We currently have documentation about DataFrame in Spark. So, it helps to add 
documentation for  SQL in Spark.
https://hivemall.incubator.apache.org/userguide/spark/binaryclass/a9a_df.html
https://hivemall.incubator.apache.org/userguide/spark/regression/e2006_df.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVEMALL-104) Support deterministic sampling in HivemallOps

2017-05-11 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-104:
--
Labels: spark  (was: )

> Support deterministic sampling in HivemallOps
> -
>
> Key: HIVEMALL-104
> URL: https://issues.apache.org/jira/browse/HIVEMALL-104
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>  Labels: spark
>
> This feature seems to be beneficial in terms of plan optimization: 
> https://issues.apache.org/jira/browse/SPARK-14166



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (HIVEMALL-104) Support deterministic sampling in HivemallOps

2017-05-11 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-104:
-

 Summary: Support deterministic sampling in HivemallOps
 Key: HIVEMALL-104
 URL: https://issues.apache.org/jira/browse/HIVEMALL-104
 Project: Hivemall
  Issue Type: Improvement
Reporter: Takeshi Yamamuro


This feature seems to be beneficial in terms of plan optimization: 
https://issues.apache.org/jira/browse/SPARK-14166



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (HIVEMALL-103) Upgrading spark-v2.1.0 to v2.1.1

2017-05-11 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-103:
-

 Summary: Upgrading spark-v2.1.0 to v2.1.1
 Key: HIVEMALL-103
 URL: https://issues.apache.org/jira/browse/HIVEMALL-103
 Project: Hivemall
  Issue Type: Improvement
Reporter: Takeshi Yamamuro


v2.1.1 has been released: 
https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVEMALL-103) Upgrading spark-v2.1.0 to v2.1.1

2017-05-11 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-103:
--
Labels: spark  (was: )

> Upgrading spark-v2.1.0 to v2.1.1
> 
>
> Key: HIVEMALL-103
> URL: https://issues.apache.org/jira/browse/HIVEMALL-103
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>  Labels: spark
>
> v2.1.1 has been released: 
> https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVEMALL-102) Support upcoming Spark v2.2.0

2017-05-11 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-102:
--
Labels: spark  (was: )

> Support upcoming Spark v2.2.0
> -
>
> Key: HIVEMALL-102
> URL: https://issues.apache.org/jira/browse/HIVEMALL-102
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>  Labels: spark
>
> Spark community is currently voting a v2.2 release: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-2-2-0-RC2-td21497.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (HIVEMALL-102) Support upcoming Spark v2.2.0

2017-05-11 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-102:
-

 Summary: Support upcoming Spark v2.2.0
 Key: HIVEMALL-102
 URL: https://issues.apache.org/jira/browse/HIVEMALL-102
 Project: Hivemall
  Issue Type: Improvement
Reporter: Takeshi Yamamuro


Spark community is currently voting a v2.2 release: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-2-2-0-RC2-td21497.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVEMALL-99) Cross-compilation of XGBoost using Docker

2017-04-27 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVEMALL-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15986583#comment-15986583
 ] 

Takeshi Yamamuro commented on HIVEMALL-99:
--

go for it!

> Cross-compilation of XGBoost using Docker
> -
>
> Key: HIVEMALL-99
> URL: https://issues.apache.org/jira/browse/HIVEMALL-99
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Makoto Yui
>Assignee: ITO Ryuichi
>Priority: Minor
>
> hivemall-xgboost jar should include native libraries such as x86-64 and else. 
> (cc: [~maropu], [~amaya])
> We can use dockcross [1] following the way in Xerial [2].
> [1] https://github.com/dockcross/dockcross
> [2] https://github.com/xerial/snappy-java/tree/master/docker



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Resolved] (HIVEMALL-44) Support Top-K joins for DataFrame/Spark

2017-03-16 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-44?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved HIVEMALL-44.
--
Resolution: Fixed
  Assignee: Takeshi Yamamuro

> Support Top-K joins for DataFrame/Spark
> ---
>
> Key: HIVEMALL-44
> URL: https://issues.apache.org/jira/browse/HIVEMALL-44
> Project: Hivemall
>  Issue Type: New Feature
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Blocker
>  Labels: Spark
>
> In Hivemall, `each_top_k` is useful for practical use cases. On the other 
> hand, there are some cases we need to join tables then compute Top-K 
> entries.You know we can compute this query by using regular joins + 
> `each_top_k`. However, we have space to improve this query more; that is, we 
> compute Top-K entries while processing joins. This optimization avoids a 
> substantial amount of  I/O for joins.
> An example query is as follows;
> {code}
> val inputDf = Seq(
>   ("user1", 1, 0.3, 0.5),
>   ("user2", 2, 0.1, 0.1),
>   ("user3", 3, 0.8, 0.0),
>   ("user4", 1, 0.9, 0.9),
>   ("user5", 3, 0.7, 0.2),
>   ("user6", 1, 0.5, 0.4),
>   ("user7", 2, 0.6, 0.8)
> ).toDF("userId", "group", "x", "y")
> val masterDf = Seq(
>   (1, "pos1-1", 0.5, 0.1),
>   (1, "pos1-2", 0.0, 0.0),
>   (1, "pos1-3", 0.3, 0.3),
>   (2, "pos2-3", 0.1, 0.3),
>   (2, "pos2-3", 0.8, 0.8),
>   (3, "pos3-1", 0.1, 0.7),
>   (3, "pos3-1", 0.7, 0.1),
>   (3, "pos3-1", 0.9, 0.0),
>   (3, "pos3-1", 0.1, 0.3)
> ).toDF("group", "position", "x", "y")
> // Compute top-1 rows for each group
> val distance = sqrt(
>   pow(inputDf("x") - masterDf("x"), lit(2.0)) +
>   pow(inputDf("y") - masterDf("y"), lit(2.0))
> )
> val top1Df = inputDf.join_top_k(
>   lit(1), masterDf, inputDf("group") === masterDf("group"),
>   distance.as("score")
> )
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVEMALL-47) Support codegen for ShuffledHashJoinTopKExec

2017-03-16 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVEMALL-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928278#comment-15928278
 ] 

Takeshi Yamamuro commented on HIVEMALL-47:
--

Resolved by https://github.com/apache/incubator-hivemall/pull/37

> Support codegen for ShuffledHashJoinTopKExec
> 
>
> Key: HIVEMALL-47
> URL: https://issues.apache.org/jira/browse/HIVEMALL-47
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>  Labels: spark
>
> https://github.com/apache/incubator-hivemall/blob/master/spark/spark-2.1/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinTopKExec.scala#L32



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Resolved] (HIVEMALL-65) Update define-all.spark and import-packages.spark

2017-03-16 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-65?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved HIVEMALL-65.
--
Resolution: Fixed

> Update define-all.spark and import-packages.spark
> -
>
> Key: HIVEMALL-65
> URL: https://issues.apache.org/jira/browse/HIVEMALL-65
> Project: Hivemall
>  Issue Type: Bug
>Reporter: Takeshi Yamamuro
>  Labels: Spark
>
> Some declarations in define-all.spark and import-packages.spark are incorrect 
> and duplicated.
> e.g. train_arowh: 
> https://github.com/maropu/incubator-hivemall/blob/AddScriptForSparkShell/resources/ddl/define-all.spark#L32-L36



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (HIVEMALL-89) Support to_csv/from_csv in HivemallOps

2017-03-08 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-89:


 Summary: Support to_csv/from_csv in HivemallOps
 Key: HIVEMALL-89
 URL: https://issues.apache.org/jira/browse/HIVEMALL-89
 Project: Hivemall
  Issue Type: Improvement
Reporter: Takeshi Yamamuro


It is useful to support to_csv/from_csv for Spark (See SPARK-15463 for related 
discussion)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVEMALL-26) Add documentation about Hivemall on Apache Spark

2017-02-23 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVEMALL-26?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880620#comment-15880620
 ] 

Takeshi Yamamuro commented on HIVEMALL-26:
--

We keep this ticket open until all the documentations filled for spark.

> Add documentation about Hivemall on Apache Spark
> 
>
> Key: HIVEMALL-26
> URL: https://issues.apache.org/jira/browse/HIVEMALL-26
> Project: Hivemall
>  Issue Type: Sub-task
>Reporter: Makoto Yui
>Assignee: Takeshi Yamamuro
>  Labels: Documentation, Spark
>
> Our user guide should have entries about Hivemall on Spark on 
> http://hivemall.incubator.apache.org/userguide/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVEMALL-65) Update define-all.spark and import-packages.spark

2017-02-14 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVEMALL-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865530#comment-15865530
 ] 

Takeshi Yamamuro commented on HIVEMALL-65:
--

We also need to check spark versions and load proper functions in these scripts.

> Update define-all.spark and import-packages.spark
> -
>
> Key: HIVEMALL-65
> URL: https://issues.apache.org/jira/browse/HIVEMALL-65
> Project: Hivemall
>  Issue Type: Bug
>Reporter: Takeshi Yamamuro
>  Labels: Spark
>
> Some declarations in define-all.spark and import-packages.spark are incorrect 
> and duplicated.
> e.g. train_arowh: 
> https://github.com/maropu/incubator-hivemall/blob/AddScriptForSparkShell/resources/ddl/define-all.spark#L32-L36



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (HIVEMALL-68) Use TaskContext in RowIdUDF

2017-02-13 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-68:


 Summary: Use TaskContext in RowIdUDF
 Key: HIVEMALL-68
 URL: https://issues.apache.org/jira/browse/HIVEMALL-68
 Project: Hivemall
  Issue Type: Improvement
Reporter: Takeshi Yamamuro
Priority: Minor


We possibly use TaskContext via Java Reflection for generating unique IDs.
https://github.com/apache/incubator-hivemall/pull/44#issuecomment-279294472



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVEMALL-68) Use TaskContext in RowIdUDF

2017-02-13 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-68?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-68:
-
Labels: Spark  (was: )

> Use TaskContext in RowIdUDF
> ---
>
> Key: HIVEMALL-68
> URL: https://issues.apache.org/jira/browse/HIVEMALL-68
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>Priority: Minor
>  Labels: Spark
>
> We possibly use TaskContext via Java Reflection for generating unique IDs.
> https://github.com/apache/incubator-hivemall/pull/44#issuecomment-279294472



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (HIVEMALL-66) Remove wrapper classes for Hive UDFs

2017-02-10 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-66:


 Summary: Remove wrapper classes for Hive UDFs
 Key: HIVEMALL-66
 URL: https://issues.apache.org/jira/browse/HIVEMALL-66
 Project: Hivemall
  Issue Type: Improvement
Reporter: Takeshi Yamamuro


Since the latest Spark does not support Map and List as return types in Hive 
UDFs, we have some GenericUDF wrapper classes in the spark module. But, spark 
community currently starts discussing supports for these types. If these types 
supported in spark, we can remove these wrapper classes. Reference: 
https://github.com/apache/spark/pull/16886



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (HIVEMALL-65) Update define-all.spark and import-packages.spark

2017-02-10 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-65:


 Summary: Update define-all.spark and import-packages.spark
 Key: HIVEMALL-65
 URL: https://issues.apache.org/jira/browse/HIVEMALL-65
 Project: Hivemall
  Issue Type: Bug
Reporter: Takeshi Yamamuro


Some declarations in define-all.spark and import-packages.spark are incorrect 
and duplicated.
e.g. train_arowh: 
https://github.com/maropu/incubator-hivemall/blob/AddScriptForSparkShell/resources/ddl/define-all.spark#L32-L36



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (HIVEMALL-62) Support a function to convert a comma-separated string into typed data and vice versa

2017-02-09 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-62:


 Summary: Support a function to convert a comma-separated string 
into typed data and vice versa
 Key: HIVEMALL-62
 URL: https://issues.apache.org/jira/browse/HIVEMALL-62
 Project: Hivemall
  Issue Type: New Feature
Reporter: Takeshi Yamamuro
Priority: Minor


Currently, spark does not have this features (IMO this feature will not appear 
as first-class ones in Spark) it is useful for ETL before ML processing.
e.x.)
{code}
scala> val ds1 = Seq("""1,abc""").toDS()
ds1: org.apache.spark.sql.Dataset[String] = [value: string]

scala> val schema = new StructType().add("a", IntegerType).add("b", StringType)
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(a,IntegerType,true), StructField(b,StringType,true))

scala> val ds2 = ds1.select(from_csv($"value", schema))
ds2: org.apache.spark.sql.DataFrame = [csvtostruct(value): struct]

scala> ds2.printSchema
root
 |-- csvtostruct(value): struct (nullable = true)
 ||-- a: integer (nullable = true)
 ||-- b: string (nullable = true)


scala> ds2.show
+--+
|csvtostruct(value)|
+--+
|   [1,abc]|
+--+
{code}
A related discussion is here: 
https://github.com/apache/spark/pull/13300#issuecomment-261962773



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVEMALL-48) Support codegen for EachTopK

2017-02-09 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVEMALL-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859668#comment-15859668
 ] 

Takeshi Yamamuro commented on HIVEMALL-48:
--

A prototype is here: 
https://github.com/apache/incubator-hivemall/compare/master...maropu:HIVEMALL-48

> Support codegen for EachTopK
> 
>
> Key: HIVEMALL-48
> URL: https://issues.apache.org/jira/browse/HIVEMALL-48
> Project: Hivemall
>  Issue Type: New Feature
>Reporter: Takeshi Yamamuro
>  Labels: spark
>
> https://github.com/apache/incubator-hivemall/blob/master/spark/spark-2.1/src/main/scala/org/apache/spark/sql/catalyst/expressions/EachTopK.scala#L124



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVEMALL-55) Drop off the Spark v1.6 support before next HIvemall GA release

2017-02-08 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-55?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-55:
-
Labels: Spark  (was: )

> Drop off the Spark v1.6 support before next HIvemall GA release
> ---
>
> Key: HIVEMALL-55
> URL: https://issues.apache.org/jira/browse/HIVEMALL-55
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>  Labels: Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (HIVEMALL-55) Drop off the Spark v1.6 support before next HIvemall GA release

2017-02-08 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created HIVEMALL-55:


 Summary: Drop off the Spark v1.6 support before next HIvemall GA 
release
 Key: HIVEMALL-55
 URL: https://issues.apache.org/jira/browse/HIVEMALL-55
 Project: Hivemall
  Issue Type: Improvement
Reporter: Takeshi Yamamuro






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVEMALL-54) Add a easy-to-use script for spark-shell

2017-02-08 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVEMALL-54?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-54:
-
Labels: Spark  (was: )

> Add a easy-to-use script for spark-shell
> 
>
> Key: HIVEMALL-54
> URL: https://issues.apache.org/jira/browse/HIVEMALL-54
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Takeshi Yamamuro
>  Labels: Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] (HIVEMALL-46) Make it more simpler to upgrade Spark versions

2017-01-31 Thread Takeshi Yamamuro (JIRA)

Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Takeshi Yamamuro created an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Hivemall /  HIVEMALL-46 
 
 
 
  Make it more simpler to upgrade Spark versions  
 
 
 
 
 
 
 
 
 

Issue Type:
 
  Improvement 
 
 
 

Assignee:
 

 Unassigned 
 
 
 

Created:
 

 31/Jan/17 12:14 
 
 
 

Priority:
 
  Major 
 
 
 

Reporter:
 
 Takeshi Yamamuro 
 
 
 
 
 
 
 
 
 
 
To support upcoming Spark releases, we currently need to copy many files from `spark/spark-2.X` to `spark/spark-2.Y' and then fix some compile errors happened there. It seems this works fine though, this copying makes an amount of code files blow up. So, we need to clean up source code structure (e.g., APIs) for easily following up-coming Spark releases. 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)

90 matches

Mail list logo